aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-02-06btrfs: restrict snapshotting to own subvolumesDavid Sterba
commit d024206133ce21936b3d5780359afc00247655b7 upstream. Currently, any user can snapshot any subvolume if the path is accessible and thus indirectly create and keep files he does not own under his direcotries. This is not possible with traditional directories. In security context, a user can snapshot root filesystem and pin any potentially buggy binaries, even if the updates are applied. All the snapshots are visible to the administrator, so it's possible to verify if there are suspicious snapshots. Another more practical problem is that any user can pin the space used by eg. root and cause ENOSPC. Original report: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786 Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06Btrfs: setup inode location during btrfs_init_inode_lockedChris Mason
commit 90d3e592e99b8e374ead2b45148abf506493a959 upstream. We have a race during inode init because the BTRFS_I(inode)->location is setup after the inode hash table lock is dropped. btrfs_find_actor uses the location field, so our search might not find an existing inode in the hash table if we race with the inode init code. This commit changes things to setup the location field sooner. Also the find actor now uses only the location objectid to match inodes. For inode hashing, we just need a unique and stable test, it doesn't have to reflect the inode numbers we show to userland. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot()Wang Shilong
commit 90515e7f5d7d24cbb2a4038a3f1b5cfa2921aa17 upstream. We may return early in btrfs_drop_snapshot(), we shouldn't call btrfs_std_err() for this case, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06hpfs: remember free spaceMikulas Patocka
commit 2cbe5c76fc5e38e9af4b709593146e4b8272b69e upstream. Previously, hpfs scanned all bitmaps each time the user asked for free space using statfs. This patch changes it so that hpfs scans the bitmaps only once, remembes the free space and on next invocation of statfs it returns the value instantly. New versions of wine are hammering on the statfs syscall very heavily, making some games unplayable when they're stored on hpfs, with load times in minutes. This should be backported to the stable kernels because it fixes user-visible problem (excessive level load times in wine). Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06vfs: Is mounted should be testing mnt_ns for NULL or error.Eric W. Biederman
commit 260a459d2e39761fbd39803497205ce1690bc7b1 upstream. A bug was introduced with the is_mounted helper function in commit f7a99c5b7c8bd3d3f533c8b38274e33f3da9096e Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sat Jun 9 00:59:08 2012 -0400 get rid of ->mnt_longterm it's enough to set ->mnt_ns of internal vfsmounts to something distinct from all struct mnt_namespace out there; then we can just use the check for ->mnt_ns != NULL in the fast path of mntput_no_expire() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> The intent was to test if the real_mount(vfsmount)->mnt_ns was NULL_OR_ERR but the code is actually testing real_mount(vfsmount) and always returning true. The result is d_absolute_path returning paths it should be hiding. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06vfs: Remove second variable named error in __dentry_pathEric W. Biederman
commit a8323da0366d3398eda62741d2ac1130c8a172ed upstream. In commit 232d2d60aa5469bb097f55728f65146bd49c1d25 Author: Waiman Long <Waiman.Long@hp.com> Date: Mon Sep 9 12:18:13 2013 -0400 dcache: Translating dentry into pathname without taking rename_lock The __dentry_path locking was changed and the variable error was intended to be moved outside of the loop. Unfortunately the inner declaration of error was not removed. Resulting in a version of __dentry_path that will never return an error. Remove the problematic inner declaration of error and allow __dentry_path to return errors once again. Cc: Waiman Long <Waiman.Long@hp.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06ext4: avoid clearing beyond i_blocks when truncating an inline data fileTheodore Ts'o
commit 09c455aaa8f47a94d5bafaa23d58365768210507 upstream. A missing cast means that when we are truncating a file which is less than 60 bytes, we don't clear the correct area of memory, and in fact we can end up truncating the next inode in the inode table, or worse yet, some other kernel data structure. Addresses-Coverity-Id: #751987 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-29GFS2: Increase i_writecount during gfs2_setattr_chownBob Peterson
commit 62e96cf81988101fe9e086b2877307b6adda5197 upstream. This patch calls get_write_access in function gfs2_setattr_chown, which merely increases inode->i_writecount for the duration of the function. That will ensure that any file closes won't delete the inode's multi-block reservation while the function is running. It also ensures that a multi-block reservation exists when needed for quota change operations during the chown. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fixes from Eric Biederman: "This is a set of 3 regression fixes. This fixes /proc/mounts when using "ip netns add <netns>" to display the actual mount point. This fixes a regression in clone that broke lxc-attach. This fixes a regression in the permission checks for mounting /proc that made proc unmountable if binfmt_misc was in use. Oops. My apologies for sending this pull request so late. Al Viro gave interesting review comments about the d_path fix that I wanted to address in detail before I sent this pull request. Unfortunately a bad round of colds kept from addressing that in detail until today. The executive summary of the review was: Al: Is patching d_path really sufficient? The prepend_path, d_path, d_absolute_path, and __d_path family of functions is a really mess. Me: Yes, patching d_path is really sufficient. Yes, the code is mess. No it is not appropriate to rewrite all of d_path for a regression that has existed for entirely too long already, when a two line change will do" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: vfs: Fix a regression in mounting proc fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) vfs: In d_path don't call d_dname on a mount point
2014-01-16Merge tag 'writeback-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fix from Wu Fengguang: "Fix data corruption on NFS writeback. It has been in linux-next for one month" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Fix data corruption on NFS
2014-01-15nilfs2: fix segctor bug that causes file system corruptionAndreas Rohner
There is a bug in the function nilfs_segctor_collect, which results in active data being written to a segment, that is marked as clean. It is possible, that this segment is selected for a later segment construction, whereby the old data is overwritten. The problem shows itself with the following kernel log message: nilfs_sufile_do_cancel_free: segment 6533 must be clean Usually a few hours later the file system gets corrupted: NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0 NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660) The issue can be reproduced with a file system that is nearly full and with the cleaner running, while some IO intensive task is running. Although it is quite hard to reproduce. This is what happens: 1. The cleaner starts the segment construction 2. nilfs_segctor_collect is called 3. sc_stage is on NILFS_ST_SUFILE and segments are freed 4. sc_stage is on NILFS_ST_DAT current segment is full 5. nilfs_segctor_extend_segments is called, which allocates a new segment 6. The new segment is one of the segments freed in step 3 7. nilfs_sufile_cancel_freev is called and produces an error message 8. Loop around and the collection starts again 9. sc_stage is on NILFS_ST_SUFILE and segments are freed including the newly allocated segment, which will contain active data and can be allocated at a later time 10. A few hours later another segment construction allocates the segment and causes file system corruption This can be prevented by simply reordering the statements. If nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments the freed segments are marked as dirty and cannot be allocated any more. Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net> Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-11Merge tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs bugfixes from Ben Myers: "Here we have a bugfix for an off-by-one in the remote attribute verifier that results in a forced shutdown which you can hit with v5 superblock by creating a 64k xattr, and a fix for a missing destroy_work_on_stack() in the allocation worker. It's a bit late, but they are both fairly straightforward" * tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs: xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() xfs: fix off-by-one error in xfs_attr3_rmt_verify
2014-01-10xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()Chuansheng Liu
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to call destroy_work_on_stack() which frees the debug object to pair with INIT_WORK_ONSTACK(). Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 6f96b3063cdd473c68664a190524ed966ac0cd92)
2014-01-10xfs: fix off-by-one error in xfs_attr3_rmt_verifyJie Liu
With CRC check is enabled, if trying to set an attributes value just equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote attr write verification procedure failure, which would yield the back trace like below: <snip> XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c <snip> Call Trace: [<ffffffff816f0042>] dump_stack+0x45/0x56 [<ffffffffa0d99c8b>] xfs_error_report+0x3b/0x40 [xfs] [<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs] [<ffffffffa0d99ce5>] xfs_corruption_error+0x55/0x80 [xfs] [<ffffffffa0dbef6b>] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs] [<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs] [<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs] [<ffffffffa0d96edd>] _xfs_buf_ioapply+0x6d/0x390 [xfs] [<ffffffff81184cda>] ? vm_map_ram+0x31a/0x460 [<ffffffff81097230>] ? wake_up_state+0x20/0x20 [<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs] [<ffffffffa0d9726b>] xfs_buf_iorequest+0x6b/0xc0 [xfs] [<ffffffffa0d97315>] xfs_bdstrat_cb+0x55/0xb0 [xfs] [<ffffffffa0d97906>] xfs_bwrite+0x46/0x80 [xfs] [<ffffffffa0dbfa94>] xfs_attr_rmtval_set+0x334/0x490 [xfs] [<ffffffffa0db84aa>] xfs_attr_leaf_addname+0x24a/0x410 [xfs] [<ffffffffa0db8893>] xfs_attr_set_int+0x223/0x470 [xfs] [<ffffffffa0db8b76>] xfs_attr_set+0x96/0xb0 [xfs] [<ffffffffa0db13b2>] xfs_xattr_set+0x42/0x70 [xfs] [<ffffffff811df9b2>] generic_setxattr+0x62/0x80 [<ffffffff811e0213>] __vfs_setxattr_noperm+0x63/0x1b0 [<ffffffff81307afe>] ? evm_inode_setxattr+0xe/0x10 [<ffffffff811e0415>] vfs_setxattr+0xb5/0xc0 [<ffffffff811e054e>] setxattr+0x12e/0x1c0 [<ffffffff811c6e82>] ? final_putname+0x22/0x50 [<ffffffff811c708b>] ? putname+0x2b/0x40 [<ffffffff811cc4bf>] ? user_path_at_empty+0x5f/0x90 [<ffffffff811bdfd9>] ? __sb_start_write+0x49/0xe0 [<ffffffff81168589>] ? vm_mmap_pgoff+0x99/0xc0 [<ffffffff811e07df>] SyS_setxattr+0x8f/0xe0 [<ffffffff81700c2d>] system_call_fastpath+0x1a/0x1f Tests: setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile This patch fix it to check the remote EA size is greater than the XATTR_SIZE_MAX rather than more than or equal to it, because it's valid if the specified EA value size is equal to the limitation as per VFS setxattr interface. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 85dd0707f0cad26d60f2dc574d17a5ab948d10f7)
2014-01-07Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfix from Ted Ts'o: "Fix a regression introduced in v3.13-rc6" * tag 'ext4_for_linus_stable' of http://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix bigalloc regression
2014-01-06ext4: fix bigalloc regressionEric Whitney
Commit f5a44db5d2 introduced a regression on filesystems created with the bigalloc feature (cluster size > blocksize). It causes xfstests generic/006 and /013 to fail with an unexpected JBD2 failure and transaction abort that leaves the test file system in a read only state. Other xfstests run on bigalloc file systems are likely to fail as well. The cause is the accidental use of a cluster mask where a cluster offset was needed in ext4_ext_map_blocks(). Signed-off-by: Eric Whitney <enwlinux@gmail.com>
2014-01-02Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge patches from Andrew Morton: "Ten fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c drivers/dma/ioat/dma.c: check DMA mapping error in ioat_dma_self_test() mm/memory-failure.c: transfer page count from head page to tail page after split thp MAINTAINERS: set up proper record for Xilinx Zynq mm: remove bogus warning in copy_huge_pmd() memcg: fix memcg_size() calculation mm: fix use-after-free in sys_remap_file_pages mm: munlock: fix deadlock in __munlock_pagevec() mm: munlock: fix a bug where THP tail page is encountered
2014-01-02epoll: do not take the nested ep->mtx on EPOLL_CTL_DELJason Baron
The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock. That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with epoll_ctl(b, EPOLL_CTL_DEL, a, x). The deadlock was introduced with commmit 67347fe4e632 ("epoll: do not take global 'epmutex' for simple topologies"). The acquistion of the ep->mtx for the destination 'ep' was added such that a concurrent EPOLL_CTL_ADD operation would see the correct state of the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links') However, by simply not acquiring the lock, we do not serialize behind the ep->mtx from the add path, and thus may perform a full path check when if we had waited a little longer it may not have been necessary. However, this is a transient state, and performing the full loop checking in this case is not harmful. The important point is that we wouldn't miss doing the full loop checking when required, since EPOLL_CTL_ADD always locks any 'ep's that its operating upon. The reason we don't need to do lock ordering in the add path, is that we are already are holding the global 'epmutex' whenever we do the double lock. Further, the original posting of this patch, which was tested for the intended performance gains, did not perform this additional locking. Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Eric Wong <normalperson@yhbt.net> Cc: Nelson Elhage <nelhage@nelhage.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02Merge tag 'gfs2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes Pull GFS2 fixes from Steven Whitehouse: "Here is a set of small fixes for GFS2. There is a fix to drop s_umount which is copied in from the core vfs, two patches relate to a hard to hit "use after free" and memory leak. Two patches related to using DIO and buffered I/O on the same file to ensure correct operation in relation to glock state changes. The final patch adds an RCU read lock to ensure correct locking on an error path" * tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes: GFS2: Fix unsafe dereference in dump_holder() GFS2: Wait for async DIO in glock state changes GFS2: Fix incorrect invalidation for DIO/buffered I/O GFS2: Fix slab memory leak in gfs2_bufdata GFS2: Fix use-after-free race when calling gfs2_remove_from_ail GFS2: don't hold s_umount over blkdev_put
2014-01-02GFS2: Fix unsafe dereference in dump_holder()Tetsuo Handa
GLOCK_BUG_ON() might call this function without RCU read lock. Make sure that RCU read lock is held when using task_struct returned from pid_task(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-27cifs: set FILE_CREATEDShirish Pargaonkar
Set FILE_CREATED on O_CREAT|O_EXCL. cifs code didn't change during commit 116cc0225381415b96551f725455d067f63a76a0 Kernel bugzilla 66251 Signed-off-by: Shirish Pargaonkar <spargaonkar@suse.com> Acked-by: Jeff Layton <jlayton@redhat.com> CC: Stable <stable@kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>
2013-12-27cifs: We do not drop reference to tlink in CIFSCheckMFSymlink()Sachin Prabhu
When we obtain tcon from cifs_sb, we use cifs_sb_tlink() to first obtain tlink which also grabs a reference to it. We do not drop this reference to tlink once we are done with the call. The patch fixes this issue by instead passing tcon as a parameter and avoids having to obtain a reference to the tlink. A lookup for the tcon is already made in the calling functions and this way we avoid having to re-run the lookup. This is also consistent with the argument list for other similar calls for M-F symlinks. We should also return an ENOSYS when we do not find a protocol specific function to lookup the MF Symlink data. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> CC: Stable <stable@kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>
2013-12-27Add missing end of line termination to some cifs messagesSteve French
Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Gregor Beck <gbeck@sernet.de> Reviewed-by: Jeff Layton <jlayton@redhat.com>
2013-12-26Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A collection of bug fixes destined for stable and some printk cleanups and a patch so that instead of BUG'ing we use the ext4_error() framework to mark the file system is corrupted" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: add explicit casts when masking cluster sizes ext4: fix deadlock when writing in ENOSPC conditions jbd2: rename obsoleted msg JBD->JBD2 jbd2: revise KERN_EMERG error messages jbd2: don't BUG but return ENOSPC if a handle runs out of space ext4: Do not reserve clusters when fs doesn't support extents ext4: fix del_timer() misuse for ->s_err_report ext4: check for overlapping extents in ext4_valid_extent_entries() ext4: fix use-after-free in ext4_mb_new_blocks ext4: call ext4_error_inode() if jbd2_journal_dirty_metadata() fails
2013-12-23Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2 fix from Jan Kara: "One simple fix of oops in ext2 which was recently hit by Christoph" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
2013-12-22Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds
Pull AIO leak fixes from Ben LaHaise: "I've put these two patches plus Linus's change through a round of tests, and it passes millions of iterations of the aio numa migratepage test, as well as a number of repetitions of a few simple read and write tests. The first patch fixes the memory leak Kent introduced, while the second patch makes aio_migratepage() much more paranoid and robust" * git://git.kvack.org/~bcrl/aio-next: aio/migratepages: make aio migrate pages sane aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
2013-12-22aio: clean up and fix aio_setup_ring page mappingLinus Torvalds
Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages migration") the aio ring setup code has used a special per-ring backing inode for the page allocations, rather than just using random anonymous pages. However, rather than remembering the pages as it allocated them, it would allocate the pages, insert them into the file mapping (dirty, so that they couldn't be free'd), and then forget about them. And then to look them up again, it would mmap the mapping, and then use "get_user_pages()" to get back an array of the pages we just created. Now, not only is that incredibly inefficient, it also leaked all the pages if the mmap failed (which could happen due to excessive number of mappings, for example). So clean it all up, making it much more straightforward. Also remove some left-overs of the previous (broken) mm_populate() usage that was removed in commit d6c355c7dabc ("aio: fix race in ring buffer page lookup introduced by page migration support") but left the pointless and now misleading MAP_POPULATE flag around. Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-21aio/migratepages: make aio migrate pages saneBenjamin LaHaise
The arbitrary restriction on page counts offered by the core migrate_page_move_mapping() code results in rather suspicious looking fiddling with page reference counts in the aio_migratepage() operation. To fix this, make migrate_page_move_mapping() take an extra_count parameter that allows aio to tell the code about its own reference count on the page being migrated. While cleaning up aio_migratepage(), make it validate that the old page being passed in is actually what aio_migratepage() expects to prevent misbehaviour in the case of races. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-12-21aio: fix kioctx leak introduced by "aio: Fix a trinity splat"Benjamin LaHaise
e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference counting to correct a bug trinity found. Unfortunately, the change lead to kioctxes being leaked because there was no final reference count to put. Add that reference count back in to fix things. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org
2013-12-20Merge tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs bugfixes from Ben Myers: "This contains fixes for some asserts related to project quotas, a memory leak, a hang when disabling group or project quotas before disabling user quotas, Dave's email address, several fixes for the alignment of file allocation to stripe unit/width geometry, a fix for an assertion with xfs_zero_remaining_bytes, and the behavior of metadata writeback in the face of IO errors. Details: - fix memory leak in xfs_dir2_node_removename - fix quota assertion in xfs_setattr_size - fix quota assertions in xfs_qm_vop_create_dqattach - fix for hang when disabling group and project quotas before disabling user quotas - fix Dave Chinner's email address in MAINTAINERS - fix for file allocation alignment - fix for assertion in xfs_buf_stale by removing xfsbdstrat - fix for alignment with swalloc mount option - fix for "retry forever" semantics on IO errors" * tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs: xfs: abort metadata writeback on permanent errors xfs: swalloc doesn't align allocations properly xfs: remove xfsbdstrat error xfs: align initial file allocations correctly MAINTAINERS: fix incorrect mail address of XFS maintainer xfs: fix infinite loop by detaching the group/project hints from user dquot xfs: fix assertion failure at xfs_setattr_nonsize xfs: fix false assertion at xfs_qm_vop_create_dqattach xfs: fix memory leak in xfs_dir2_node_removename
2013-12-20pstore: Don't allow high traffic options on fragile devicesLuck, Tony
Some pstore backing devices use on board flash as persistent storage. These have limited numbers of write cycles so it is a poor idea to use them from high frequency operations. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-20ext4: add explicit casts when masking cluster sizesTheodore Ts'o
The missing casts can cause the high 64-bits of the physical blocks to be lost. Set up new macros which allows us to make sure the right thing happen, even if at some point we end up supporting larger logical block numbers. Thanks to the Emese Revfy and the PaX security team for reporting this issue. Reported-by: PaX Team <pageexec@freemail.hu> Reported-by: Emese Revfy <re.emese@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-12-20GFS2: Wait for async DIO in glock state changesSteven Whitehouse
We need to wait for any outstanding DIO to complete in a couple of situations. Firstly, in case we are changing out of deferred mode (in inode_go_sync) where GLF_DIRTY will not be set. That call could be prefixed with a test for gl_state == LM_ST_DEFERRED but it doesn't seem worth it bearing in mind that the test for outstanding DIO is very quick anyway, in the usual case that there is none. The second case is in inode_go_lock which will catch the cases where we have a cached EX lock, but where we grant deferred locks against it so that there is no glock state transistion. We only need to wait if the state is not deferred, since DIO is valid anyway in that state. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-20GFS2: Fix incorrect invalidation for DIO/buffered I/OSteven Whitehouse
In patch 209806aba9d540dde3db0a5ce72307f85f33468f we allowed local deferred locks to be granted against a cached exclusive lock. That opened up a corner case which this patch now fixes. The solution to the problem is to check whether we have cached pages each time we do direct I/O and if so to unmap, flush and invalidate those pages. Since the glock state machine normally does that for us, mostly the code will be a no-op. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-18Merge tag 'driver-core-3.13-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fix from Greg KH: "Here's a single sysfs fix for 3.13-rc5 that resolves a lockdep issue in sysfs that has been reported" * tag 'driver-core-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: sysfs: give different locking key to regular and bin files
2013-12-18ext4: fix deadlock when writing in ENOSPC conditionsJan Kara
Akira-san has been reporting rare deadlocks of his machine when running xfstests test 269 on ext4 filesystem. The problem turned out to be in ext4_da_reserve_metadata() and ext4_da_reserve_space() which called ext4_should_retry_alloc() while holding i_data_sem. Since ext4_should_retry_alloc() can force a transaction commit, this is a lock ordering violation and leads to deadlocks. Fix the problem by just removing the retry loops. These functions should just report ENOSPC to the caller (e.g. ext4_da_write_begin()) and that function must take care of retrying after dropping all necessary locks. Reported-and-tested-by: Akira Fujita <a-fujita@rs.jp.nec.com> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-12-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull two Ceph fixes from Sage Weil: "One of these is fixing a regression from the d_flags file type patch that went into -rc1 that broke instantiation of inodes and dentries (we were doing dentries first). The other is just an off-by-one corner case" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: Avoid data inconsistency due to d-cache aliasing in readpage() ceph: initialize inode before instantiating dentry
2013-12-17xfs: abort metadata writeback on permanent errorsDave Chinner
If we are doing aysnc writeback of metadata, we can get write errors but have nobody to report them to. At the moment, we simply attempt to reissue the write from io completion in the hope that it's a transient error. When it's not a transient error, the buffer is stuck forever in this loop, and we cannot break out of it. Eventually, unmount will hang because the AIL cannot be emptied and everything goes downhill from them. To solve this problem, only retry the write IO once before aborting it. We don't throw the buffer away because some transient errors can last minutes (e.g. FC path failover) or even hours (thin provisioned devices that have run out of backing space) before they go away. Hence we really want to keep trying until we can't try any more. Because the buffer was not cleaned, however, it does not get removed from the AIL and hence the next pass across the AIL will start IO on it again. As such, we still get the "retry forever" semantics that we currently have, but we allow other access to the buffer in the mean time. Meanwhile the filesystem can continue to modify the buffer and relog it, so the IO errors won't hang the log or the filesystem. Now when we are pushing the AIL, we can see all these "permanent IO error" buffers and we can issue a warning about failures before we retry the IO. We can also catch these buffers when unmounting an issue a corruption warning, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: swalloc doesn't align allocations properlyDave Chinner
When swalloc is specified as a mount option, allocations are supposed to be aligned to the stripe width rather than the stripe unit of the underlying filesystem. However, it does not do this. What the implementation does is round up the allocation size to a stripe width, hence ensuring that all allocations span a full stripe width. It does not, however, ensure that that allocation is aligned to a stripe width, and hence the allocations can span multiple underlying stripes and so still see RMW cycles for things like direct IO on MD RAID. So, if the swalloc mount option is set, change the allocation alignment in xfs_bmap_btalloc() to use the stripe width rather than the stripe unit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: remove xfsbdstrat errorChristoph Hellwig
The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that handles the case of a shut down filesystem. Most of the users have private, uncached buffers that can just be freed in this case, but the complex error handling in xfs_bioerror_relse messes up the case when it's called without a locked buffer. Remove xfsbdstrat and opencode the error handling in the callers. All but one can simply return an error and don't need to deal with buffer state, and the one caller that cares about the buffer state could do with a major cleanup as well, but we'll defer that to later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: align initial file allocations correctlyDave Chinner
The function xfs_bmap_isaeof() is used to indicate that an allocation is occurring at or past the end of file, and as such should be aligned to the underlying storage geometry if possible. Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the behaviour of this function for empty files - it turned off allocation alignment for this case accidentally. Hence large initial allocations from direct IO are not getting correctly aligned to the underlying geometry, and that is cause write performance to drop in alignment sensitive configurations. Fix it by considering allocation into empty files as requiring aligned allocation again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit f9b395a8ef8f34d19cae2cde361e19c96e097fad)
2013-12-17xfs: fix infinite loop by detaching the group/project hints from user dquotJie Liu
xfs_quota(8) will hang up if trying to turn group/project quota off before the user quota is off, this could be 100% reproduced by: # mount -ouquota,gquota /dev/sda7 /xfs # mkdir /xfs/test # xfs_quota -xc 'off -g' /xfs <-- hangs up # echo w > /proc/sysrq-trigger # dmesg SysRq : Show Blocked State task PC stack pid father xfs_quota D 0000000000000000 0 27574 2551 0x00000000 [snip] Call Trace: [<ffffffff81aaa21d>] schedule+0xad/0xc0 [<ffffffff81aa327e>] schedule_timeout+0x35e/0x3c0 [<ffffffff8114b506>] ? mark_held_locks+0x176/0x1c0 [<ffffffff810ad6c0>] ? call_timer_fn+0x2c0/0x2c0 [<ffffffffa0c25380>] ? xfs_qm_shrink_count+0x30/0x30 [xfs] [<ffffffff81aa3306>] schedule_timeout_uninterruptible+0x26/0x30 [<ffffffffa0c26155>] xfs_qm_dquot_walk+0x235/0x260 [xfs] [<ffffffffa0c059d8>] ? xfs_perag_get+0x1d8/0x2d0 [xfs] [<ffffffffa0c05805>] ? xfs_perag_get+0x5/0x2d0 [xfs] [<ffffffffa0b7707e>] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs] [<ffffffffa0c22280>] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs] [<ffffffffa0b7709f>] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs] [<ffffffffa0c261e6>] xfs_qm_dqpurge_all+0x66/0xb0 [xfs] [<ffffffffa0c2497a>] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs] [<ffffffffa0c2b8f6>] xfs_fs_set_xstate+0x136/0x180 [xfs] [<ffffffff8136cf7a>] do_quotactl+0x53a/0x6b0 [<ffffffff812fba4b>] ? iput+0x5b/0x90 [<ffffffff8136d257>] SyS_quotactl+0x167/0x1d0 [<ffffffff814cf2ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff81abcd19>] system_call_fastpath+0x16/0x1b It's fine if we turn user quota off at first, then turn off other kind of quotas if they are enabled since the group/project dquot refcount is decreased to zero once the user quota if off. Otherwise, those dquots refcount is non-zero due to the user dquot might refer to them as hint(s). Hence, above operation cause an infinite loop at xfs_qm_dquot_walk() while trying to purge dquot cache. This problem has been around since Linux 3.4, it was introduced by: [ b84a3a9675 xfs: remove the per-filesystem list of dquots ] Originally we will release the group dquot pointers because the user dquots maybe carrying around as a hint via xfs_qm_detach_gdquots(). However, with above change, there is no such work to be done before purging group/project dquot cache. In order to solve this problem, this patch introduces a special routine xfs_qm_dqpurge_hints(), and it would release the group/project dquot pointers the user dquots maybe carrying around as a hint, and then it will proceed to purge the user dquot cache if requested. Cc: stable@vger.kernel.org Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit df8052e7dae00bde6f21b40b6e3e1099770f3afc)
2013-12-17xfs: fix assertion failure at xfs_setattr_nonsizeJie Liu
For CRC enabled v5 super block, change a file's ownership can simply trigger an ASSERT failure at xfs_setattr_nonsize() if both group and project quota are enabled, i.e, [ 305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621 [ 305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable] [ 305.383939] Call Trace: [ 305.385536] [<ffffffffa0a7d95a>] xfs_setattr_nonsize+0x69a/0x720 [xfs] [ 305.387142] [<ffffffffa0a7dea9>] xfs_vn_setattr+0x29/0x70 [xfs] [ 305.388727] [<ffffffff811ca388>] notify_change+0x1a8/0x350 [ 305.390298] [<ffffffff811ac39d>] chown_common+0xfd/0x110 [ 305.391868] [<ffffffff811ad6bf>] SyS_fchownat+0xaf/0x110 [ 305.393440] [<ffffffff811ad760>] SyS_lchown+0x20/0x30 [ 305.394995] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f [ 305.399870] RIP [<ffffffffa0a7fa32>] assfail+0x22/0x30 [xfs] This fix adjust the assertion to check if the super block support both quota inodes or not. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 5a01dd54f4a7fb513062070c5acef20d13cad980)
2013-12-17xfs: fix false assertion at xfs_qm_vop_create_dqattachJie Liu
After the previous fix, there still has another ASSERT failure if turning off any type of quota while fsstress is running at the same time. Backtrace in this case: [ 50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118 [ 50.867924] ------------[ cut here ]------------ ... <snip> [ 50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable] [ 50.867999] invalid opcode: 0000 [#1] SMP [ 50.869407] Call Trace: [ 50.869446] [<ffffffffa0bc408a>] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs] [ 50.869512] [<ffffffffa0b9cc45>] xfs_create+0x5c5/0x6a0 [xfs] [ 50.869564] [<ffffffffa0b5307c>] xfs_vn_mknod+0xac/0x1d0 [xfs] [ 50.869615] [<ffffffffa0b531d6>] xfs_vn_mkdir+0x16/0x20 [xfs] [ 50.869655] [<ffffffff811becd5>] vfs_mkdir+0x95/0x130 [ 50.869689] [<ffffffff811bf63a>] SyS_mkdirat+0xaa/0xe0 [ 50.869723] [<ffffffff811bf689>] SyS_mkdir+0x19/0x20 [ 50.869757] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f [ 50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 <snip> [ 50.870003] RIP [<ffffffffa0b55a32>] assfail+0x22/0x30 [xfs] [ 50.870050] RSP <ffff88002941fd60> [ 50.879251] ---[ end trace c93a2b342341c65b ]--- We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(), however the assertion itself is not right IMHO. While performing quota off, we firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking any special locks, see xfs_qm_scall_quotaoff(). Hence there is no guarantee that the desired quota is still active. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 37eb9706ebf5b99d14c6086cdeef2c2f73f9c9fb)
2013-12-17xfs: fix memory leak in xfs_dir2_node_removenameMark Tinguely
Fix the leak of kernel memory in xfs_dir2_node_removename() when xfs_dir2_leafn_remove() returns an error code. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit ef701600fd26cace9d513ee174688a2b83832126)
2013-12-13GFS2: Fix slab memory leak in gfs2_bufdataBob Peterson
This patch fixes a slab memory leak that sometimes can occur for files with a very short lifespan. The problem occurs when a dinode is deleted before it has gotten to the journal properly. In the leak scenario, the bd object is pinned for journal committment (queued to the metadata buffers queue: sd_log_le_buf) but is subsequently unpinned and dequeued before it finds its way to the ail or the revoke queue. In this rare circumstance, the bd object needs to be freed from slab memory, or it is forgotten. We have to be very careful how we do it, though, because multiple processes can call gfs2_remove_from_journal. In order to avoid double-frees, only the process that does the unpinning is allowed to free the bd. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-13GFS2: Fix use-after-free race when calling gfs2_remove_from_ailBob Peterson
Function gfs2_remove_from_ail drops the reference on the bh via brelse. This patch fixes a race condition whereby bh is deferenced after the brelse when setting bd->bd_blkno = bh->b_blocknr; Under certain rare circumstances, bh might be gone or reused, and bd->bd_blkno is set to whatever that memory happens to be, which is often 0. Later, in gfs2_trans_add_unrevoke, that bd fails the test "bd->bd_blkno >= blkno" which causes it to never be freed. The end result is that the bd is never freed from the bufdata cache, which results in this error: slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-13GFS2: don't hold s_umount over blkdev_putSteven Whitehouse
This is a GFS2 version of Tejun's patch: 4f331f01b9c43bf001d3ffee578a97a1e0633eac vfs: don't hold s_umount over close_bdev_exclusive() call In this case its blkdev_put itself that is the issue and this patch uses the same solution of dropping and retaking s_umount. Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-14writeback: Fix data corruption on NFSJan Kara
Commit 4f8ad655dbc8 "writeback: Refactor writeback_single_inode()" added a condition to skip clean inode. However this is wrong in WB_SYNC_ALL mode because there we also want to wait for outstanding writeback on possibly clean inode. This was causing occasional data corruption issues on NFS because it uses sync_inode() to make sure all outstanding writes are flushed to the server before truncating the inode and with sync_inode() returning prematurely file was sometimes extended back by an outstanding write after it was truncated. So modify the test to also check for pages under writeback in WB_SYNC_ALL mode. CC: stable@vger.kernel.org # >= 3.5 Fixes: 4f8ad655dbc82cf05d2edc11e66b78a42d38bf93 Reported-and-tested-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
2013-12-13ceph: Avoid data inconsistency due to d-cache aliasing in readpage()Li Wang
If the length of data to be read in readpage() is exactly PAGE_CACHE_SIZE, the original code does not flush d-cache for data consistency after finishing reading. This patches fixes this. Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Sage Weil <sage@inktank.com>