Age | Commit message (Collapse) | Author |
|
commit 22e757a49cf010703fcb9c9b4ef793248c39b0c2 upstream.
generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:
1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)
where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.
The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?
Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
what?
OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.
This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.
Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5fd364fee81a7888af806e42ed8a91c845894f2d upstream.
When running xfs/305, I noticed that quotacheck was flushing dquot
buffers that did not have the xfs_dquot_buf_ops verifiers attached:
XFS (vdb): _xfs_buf_ioapply: no ops on block 0x1dc8/0x1dc8
ffff880052489000: 44 51 01 04 00 00 65 b8 00 00 00 00 00 00 00 00 DQ....e.........
ffff880052489010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
ffff880052489020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
ffff880052489030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
CPU: 1 PID: 2376 Comm: mount Not tainted 3.16.0-rc2-dgc+ #306
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff88006fe38000 ffff88004a0ffae8 ffffffff81cf1cca 0000000000000001
ffff88004a0ffb88 ffffffff814d50ca 000010004a0ffc70 0000000000000000
ffff88006be56dc4 0000000000000021 0000000000001dc8 ffff88007c773d80
Call Trace:
[<ffffffff81cf1cca>] dump_stack+0x45/0x56
[<ffffffff814d50ca>] _xfs_buf_ioapply+0x3ca/0x3d0
[<ffffffff810db520>] ? wake_up_state+0x20/0x20
[<ffffffff814d51f5>] ? xfs_bdstrat_cb+0x55/0xb0
[<ffffffff814d513b>] xfs_buf_iorequest+0x6b/0xd0
[<ffffffff814d51f5>] xfs_bdstrat_cb+0x55/0xb0
[<ffffffff814d53ab>] __xfs_buf_delwri_submit+0x15b/0x220
[<ffffffff814d6040>] ? xfs_buf_delwri_submit+0x30/0x90
[<ffffffff814d6040>] xfs_buf_delwri_submit+0x30/0x90
[<ffffffff8150f89d>] xfs_qm_quotacheck+0x17d/0x3c0
[<ffffffff81510591>] xfs_qm_mount_quotas+0x151/0x1e0
[<ffffffff814ed01c>] xfs_mountfs+0x56c/0x7d0
[<ffffffff814f0f12>] xfs_fs_fill_super+0x2c2/0x340
[<ffffffff811c9fe4>] mount_bdev+0x194/0x1d0
[<ffffffff814f0c50>] ? xfs_finish_flags+0x170/0x170
[<ffffffff814ef0f5>] xfs_fs_mount+0x15/0x20
[<ffffffff811ca8c9>] mount_fs+0x39/0x1b0
[<ffffffff811e4d67>] vfs_kern_mount+0x67/0x120
[<ffffffff811e757e>] do_mount+0x23e/0xad0
[<ffffffff8117abde>] ? __get_free_pages+0xe/0x50
[<ffffffff811e71e6>] ? copy_mount_options+0x36/0x150
[<ffffffff811e8103>] SyS_mount+0x83/0xc0
[<ffffffff81cfd40b>] tracesys+0xdd/0xe2
This was caused by dquot buffer readahead not attaching a verifier
structure to the buffer when readahead was issued, resulting in the
followup read of the buffer finding a valid buffer and so not
attaching new verifiers to the buffer as part of the read.
Also, when a verifier failure occurs, we then read the buffer
without verifiers. Attach the verifiers manually after this read so
that if the buffer is then written it will be verified that the
corruption has been repaired.
Further, when flushing a dquot we don't ask for a verifier when
reading in the dquot buffer the dquot belongs to. Most of the time
this isn't an issue because the buffer is still cached, but when it
is not cached it will result in writing the dquot buffer without
having the verfier attached.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 67dc288c21064b31a98a53dc64f6b9714b819fd6 upstream.
Crash testing of CRC enabled filesystems has resulted in a number of
reports of bad CRCs being detected after the filesystem was mounted.
Errors such as the following were being seen:
XFS (sdb3): Mounting V5 Filesystem
XFS (sdb3): Starting recovery (logdev: internal)
XFS (sdb3): Metadata CRC error detected at xfs_agf_read_verify+0x5a/0x100 [xfs], block 0x1
XFS (sdb3): Unmount and run xfs_repair
XFS (sdb3): First 64 bytes of corrupted metadata buffer:
ffff880136ffd600: 58 41 47 46 00 00 00 01 00 00 00 00 00 0f aa 40 XAGF...........@
ffff880136ffd610: 00 02 6d 53 00 02 77 f8 00 00 00 00 00 00 00 01 ..mS..w.........
ffff880136ffd620: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 03 ................
ffff880136ffd630: 00 00 00 04 00 08 81 d0 00 08 81 a7 00 00 00 00 ................
XFS (sdb3): metadata I/O error: block 0x1 ("xfs_trans_read_buf_map") error 74 numblks 1
The errors were typically being seen in AGF, AGI and their related
btree block buffers some time after log recovery had run. Often it
wasn't until later subsequent mounts that the problem was
discovered. The common symptom was a buffer with the correct
contents, but a CRC and an LSN that matched an older version of the
contents.
Some debug added to _xfs_buf_ioapply() indicated that buffers were
being written without verifiers attached to them from log recovery,
and Jan Kara isolated the cause to log recovery readahead an dit's
interactions with buffers that had a more recent LSN on disk than
the transaction being recovered. In this case, the buffer did not
get a verifier attached, and os when the second phase of log
recovery ran and recovered EFIs and unlinked inodes, the buffers
were modified and written without the verifier running. Hence they
had up to date contents, but stale LSNs and CRCs.
Fix it by attaching verifiers to buffers we skip due to future LSN
values so they don't escape into the buffer cache without the
correct verifier attached.
This patch is based on analysis and a patch from Jan Kara.
Reported-by: Jan Kara <jack@suse.cz>
Reported-by: Fanael Linithien <fanael4@gmail.com>
Reported-by: Grozdan <neutrino8@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 12a5b5294cb1896e9a3c9fca8ff5a7e3def4e8c6 upstream.
Since 3.14 we had copy_tree() get the shadowing wrong - if we had one
vfsmount shadowing another (i.e. if A is a slave of B, C is mounted
on A/foo, then D got mounted on B/foo creating D' on A/foo shadowed
by C), copy_tree() of A would make a copy of D' shadow the the copy of
C, not the other way around.
It's easy to fix, fortunately - just make sure that mount follows
the one that shadows it in mnt_child as well as in mnt_hash, and when
copy_tree() decides to attach a new mount, check if the last child
it has added to the same parent should be shadowing the new one.
And if it should, just use the same logics commit_tree() has - put the
new mount into the hash and children lists right after the one that
should shadow it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 81b6b06197606b4bef4e427a197aeb808e8d89e1 upstream.
We need the parents of victims alive until namespace_unlock() gets to
dput() of the (ex-)mountpoints. However, that screws up the "is it
busy" checks in case when we have shrinkable mounts that need to be
killed. Solution: go ahead and decrement refcounts of parents right
in umount_tree(), increment them again just before dropping rwsem in
namespace_unlock() (and let the loop in the end of namespace_unlock()
finally drop those references for good, as we do now). Parents can't
get freed until we drop rwsem - at least one reference is kept until
then, both in case when parent is among the victims and when it is
not. So they'll still be around when we get to namespace_unlock().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 88b368f27a094277143d8ecd5a056116f6a41520 upstream.
The check in __propagate_umount() ("has somebody explicitly mounted
something on that slave?") is done *before* taking the already doomed
victims out of the child lists.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ffbc6f0ead47fa5a1dc9642b0331cb75c20a640e upstream.
Since March 2009 the kernel has treated the state that if no
MS_..ATIME flags are passed then the kernel defaults to relatime.
Defaulting to relatime instead of the existing atime state during a
remount is silly, and causes problems in practice for people who don't
specify any MS_...ATIME flags and to get the default filesystem atime
setting. Those users may encounter a permission error because the
default atime setting does not work.
A default that does not work and causes permission problems is
ridiculous, so preserve the existing value to have a default
atime setting that is always guaranteed to work.
Using the default atime setting in this way is particularly
interesting for applications built to run in restricted userspace
environments without /proc mounted, as the existing atime mount
options of a filesystem can not be read from /proc/mounts.
In practice this fixes user space that uses the default atime
setting on remount that are broken by the permission checks
keeping less privileged users from changing more privileged users
atime settings.
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9566d6742852c527bf5af38af5cbb878dad75705 upstream.
While invesgiating the issue where in "mount --bind -oremount,ro ..."
would result in later "mount --bind -oremount,rw" succeeding even if
the mount started off locked I realized that there are several
additional mount flags that should be locked and are not.
In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
flags in addition to MNT_READONLY should all be locked. These
flags are all per superblock, can all be changed with MS_BIND,
and should not be changable if set by a more privileged user.
The following additions to the current logic are added in this patch.
- nosuid may not be clearable by a less privileged user.
- nodev may not be clearable by a less privielged user.
- noexec may not be clearable by a less privileged user.
- atime flags may not be changeable by a less privileged user.
The logic with atime is that always setting atime on access is a
global policy and backup software and auditing software could break if
atime bits are not updated (when they are configured to be updated),
and serious performance degradation could result (DOS attack) if atime
updates happen when they have been explicitly disabled. Therefore an
unprivileged user should not be able to mess with the atime bits set
by a more privileged user.
The additional restrictions are implemented with the addition of
MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
mnt flags.
Taken together these changes and the fixes for MNT_LOCK_READONLY
should make it safe for an unprivileged user to create a user
namespace and to call "mount --bind -o remount,... ..." without
the danger of mount flags being changed maliciously.
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 07b645589dcda8b7a5249e096fece2a67556f0f4 upstream.
There are no races as locked mount flags are guaranteed to never change.
Moving the test into do_remount makes it more visible, and ensures all
filesystem remounts pass the MNT_LOCK_READONLY permission check. This
second case is not an issue today as filesystem remounts are guarded
by capable(CAP_DAC_ADMIN) and thus will always fail in less privileged
mount namespaces, but it could become an issue in the future.
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a6138db815df5ee542d848318e5dae681590fccd upstream.
Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.
Correct this by replacing the mask of mount flags to preserve
with a mask of mount flags that may be changed, and preserve
all others. This ensures that any future bugs with this mask and
remount will fail in an easy to detect way where new mount flags
simply won't change.
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7d8b6c63751cfbbe5eef81a48c22978b3407a3ad upstream.
This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
plus fixing it a different way...
We found, when trying to run an application from an application which
had dropped privs that the kernel does security checks on undefined
capability bits. This was ESPECIALLY difficult to debug as those
undefined bits are hidden from /proc/$PID/status.
Consider a root application which drops all capabilities from ALL 4
capability sets. We assume, since the application is going to set
eff/perm/inh from an array that it will clear not only the defined caps
less than CAP_LAST_CAP, but also the higher 28ish bits which are
undefined future capabilities.
The BSET gets cleared differently. Instead it is cleared one bit at a
time. The problem here is that in security/commoncap.c::cap_task_prctl()
we actually check the validity of a capability being read. So any task
which attempts to 'read all things set in bset' followed by 'unset all
things set in bset' will not even attempt to unset the undefined bits
higher than CAP_LAST_CAP.
So the 'parent' will look something like:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffc000000000
All of this 'should' be fine. Given that these are undefined bits that
aren't supposed to have anything to do with permissions. But they do...
So lets now consider a task which cleared the eff/perm/inh completely
and cleared all of the valid caps in the bset (but not the invalid caps
it couldn't read out of the kernel). We know that this is exactly what
the libcap-ng library does and what the go capabilities library does.
They both leave you in that above situation if you try to clear all of
you capapabilities from all 4 sets. If that root task calls execve()
the child task will pick up all caps not blocked by the bset. The bset
however does not block bits higher than CAP_LAST_CAP. So now the child
task has bits in eff which are not in the parent. These are
'meaningless' undefined bits, but still bits which the parent doesn't
have.
The problem is now in cred_cap_issubset() (or any operation which does a
subset test) as the child, while a subset for valid cap bits, is not a
subset for invalid cap bits! So now we set durring commit creds that
the child is not dumpable. Given it is 'more priv' than its parent. It
also means the parent cannot ptrace the child and other stupidity.
The solution here:
1) stop hiding capability bits in status
This makes debugging easier!
2) stop giving any task undefined capability bits. it's simple, it you
don't put those invalid bits in CAP_FULL_SET you won't get them in init
and you won't get them in any other task either.
This fixes the cap_issubset() tests and resulting fallout (which
made the init task in a docker container untraceable among other
things)
3) mask out undefined bits when sys_capset() is called as it might use
~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.
4) mask out undefined bit when we read a file capability off of disk as
again likely all bits are set in the xattr for forward/backward
compatibility.
This lets 'setcap all+pe /bin/bash; /bin/bash' run
Signed-off-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Steve Grubb <sgrubb@redhat.com>
Cc: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aee7af356e151494d5014f57b33460b162f181b5 upstream.
In the presence of delegations, we can no longer assume that the
state->n_rdwr, state->n_rdonly, state->n_wronly reflect the open
stateid share mode, and so we need to calculate the initial value
for calldata->arg.fmode using the state->flags.
Reported-by: James Drews <drews@engr.wisc.edu>
Fixes: 88069f77e1ac5 (NFSv41: Fix a potential state leakage when...)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f87d928f6d98644d39809a013a22f981d39017cf upstream.
When creating a new object on the NFS server, we should not be sending
posix setacl requests unless the preceding posix_acl_create returned a
non-trivial acl. Doing so, causes Solaris servers in particular to
return an EINVAL.
Fixes: 013cdf1088d72 (nfs: use generic posix ACL infrastructure,,,)
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1132786
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3c45ddf823d679a820adddd53b52c6699c9a05ac upstream.
The current code always selects XPRT_TRANSPORT_BC_TCP for the back
channel, even when the forward channel was not TCP (eg, RDMA). When
a 4.1 mount is attempted with RDMA, the server panics in the TCP BC
code when trying to send CB_NULL.
Instead, construct the transport protocol number from the forward
channel transport or'd with XPRT_TRANSPORT_BC. Transports that do
not support bi-directional RPC will not have registered a "BC"
transport, causing create_backchannel_client() to fail immediately.
Fixes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=265
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7a9e75a185e6b3a3860e6a26fb6e88691fc2c9d9 upstream.
There was a check for result being not NULL. But get_acl() may return
NULL, or ERR_PTR, or actual pointer.
The purpose of the function where current change is done is to "list
ACLs only when they are available", so any error condition of get_acl()
mustn't be elevated, and returning 0 there is still valid.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81111
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 74adf83f5d77 (nfs: only show Posix ACLs in listxattr if actually...)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d9499a95716db0d4bc9b67e88fd162133e7d6b08 upstream.
A memory allocation failure could cause nfsd_startup_generic to fail, in
which case nfsd_users wouldn't be incorrectly left elevated.
After nfsd restarts nfsd_startup_generic will then succeed without doing
anything--the first consequence is likely nfs4_start_net finding a bad
laundry_wq and crashing.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: 4539f14981ce "nfsd: replace boolean nfsd_up flag by users counter"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit db9ee220361de03ee86388f9ea5e529eaad5323c upstream.
It turns out that there are some serious problems with the on-disk
format of journal checksum v2. The foremost is that the function to
calculate descriptor tag size returns sizes that are too big. This
causes alignment issues on some architectures and is compounded by the
fact that some parts of jbd2 use the structure size (incorrectly) to
determine the presence of a 64bit journal instead of checking the
feature flags.
Therefore, introduce journal checksum v3, which enlarges the
descriptor block tag format to allow for full 32-bit checksums of
journal blocks, fix the journal tag function to return the correct
sizes, and fix the jbd2 recovery code to use feature flags to
determine 64bitness.
Add a few function helpers so we don't have to open-code quite so
many pieces.
Switching to a 16-byte block size was found to increase journal size
overhead by a maximum of 0.1%, to convert a 32-bit journal with no
checksumming to a 32-bit journal with checksum v3 enabled.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 022eaa7517017efe4f6538750c2b59a804dc7df7 upstream.
When recovering the journal, don't fall into an infinite loop if we
encounter a corrupt journal block. Instead, just skip the block and
return an error, which fails the mount and thus forces the user to run
a full filesystem fsck.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6603120e96eae9a5d6228681ae55c7fdc998d1bb upstream.
In case of delalloc block i_disksize may be less than i_size. So we
have to update i_disksize each time we allocated and submitted some
blocks beyond i_disksize. We weren't doing this on the error paths,
so fix this.
testcase: xfstest generic/019
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 38c1c2e44bacb37efd68b90b3f70386a8ee370ee upstream.
The crash is
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:2124!
[...]
Workqueue: btrfs-endio normal_work_helper [btrfs]
RIP: 0010:[<ffffffffa02d6055>] [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]
This is in fact a regression.
It is because we forgot to increase @offset properly in reading corrupted block,
so that the @offset remains, and this leads to checksum errors while reading
left blocks queued up in the same bio, and then ends up with hiting the above
BUG_ON.
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ce62003f690dff38d3164a632ec69efa15c32cbf upstream.
When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6f7ff6d7832c6be13e8c95598884dbc40ad69fb7 upstream.
Before processing the extent buffer, acquire a read lock on it, so
that we're safe against concurrent updates on the extent buffer.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 27b9a8122ff71a8cadfbffb9c4f0694300464f3b upstream.
Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.
The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.
For example, consider the following scenario, where we have:
sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
Leaf N:
slot = 0 slot = btrfs_header_nritems() - 1
|-------------------------------------------------------------------|
| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
|-------------------------------------------------------------------|
Leaf N + 1:
slot = 0 slot = btrfs_header_nritems() - 1
|--------------------------------------------------------------------|
| [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
|--------------------------------------------------------------------|
Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:
slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1
|----------------------------------------------------------------------------------------------------|
| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] |
|----------------------------------------------------------------------------------------------------|
And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:
tmp = min((sums->len - total_bytes) >> blocksize_bits,
(next_offset - file_key.offset) >> blocksize_bits) =
min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
and
ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.
So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:
Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4eb1f66dce6c4dc28dd90a7ffbe6b2b1cb08aa4e upstream.
We've got bug reports that btrfs crashes when quota is enabled on
32bit kernel, typically with the Oops like below:
BUG: unable to handle kernel NULL pointer dereference at 00000004
IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
*pde = 00000000
Oops: 0000 [#1] SMP
CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1
Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
task: f1478130 ti: f147c000 task.ti: f147c000
EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
Stack:
00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
Call Trace:
[<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
[<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
[<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
[<c025e38b>] process_one_work+0x11b/0x390
[<c025eea1>] worker_thread+0x101/0x340
[<c026432b>] kthread+0x9b/0xb0
[<c0712a71>] ret_from_kernel_thread+0x21/0x30
[<c0264290>] kthread_create_on_node+0x110/0x110
This indicates a NULL corruption in prefs_delayed list. The further
investigation and bisection pointed that the call of ulist_add_merge()
results in the corruption.
ulist_add_merge() takes u64 as aux and writes a 64bit value into
old_aux. The callers of this function in backref.c, however, pass a
pointer of a pointer to old_aux. That is, the function overwrites
64bit value on 32bit pointer. This caused a NULL in the adjacent
variable, in this case, prefs_delayed.
Here is a quick attempt to band-aid over this: a new function,
ulist_add_merge_ptr() is introduced to pass/store properly a pointer
value instead of u64. There are still ugly void ** cast remaining
in the callers because void ** cannot be taken implicitly. But, it's
safer than explicit cast to u64, anyway.
Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c99d1e6e83b06744c75d9f5e491ed495a7086b7b upstream.
If we suffer a block allocation failure (for example due to a memory
allocation failure), it's possible that we will call
ext4_discard_allocated_blocks() before we've actually allocated any
blocks. In that case, fe_len and fe_start in ac->ac_f_ex will still
be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
triggering the BUG_ON on mb_free_blocks():
BUG_ON(last >= (sb->s_blocksize << 3));
Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
is zero.
Also fix a missing ext4_mb_unload_buddy() call in
ext4_discard_allocated_blocks().
Google-Bug-Id: 16844242
Fixes: 86f0afd463215fc3e58020493482faa4ac3a4d69
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 485d44022a152c0254dd63445fdb81c4194cbf0e upstream.
[ I'm currently running my tests on it now, and so far, after a few
hours it has yet to blow up. I'll run it for 24 hours which it never
succeeded in the past. ]
The tracing code has a way to make directories within the debugfs file
system as well as deleting them using mkdir/rmdir in the instance
directory. This is very limited in functionality, such as there is
no renames, and the parent directory "instance" can not be modified.
The tracing code creates the instance directory from the debugfs code
and then replaces the dentry->d_inode->i_op with its own to allow
for mkdir/rmdir to work.
When these are called, the d_entry and inode locks need to be released
to call the instance creation and deletion code. That code has its own
accounting and locking to serialize everything to prevent multiple
users from causing harm. As the parent "instance" directory can not
be modified this simplifies things.
I created a stress test that creates several threads that randomly
creates and deletes directories thousands of times a second. The code
stood up to this test and I submitted it a while ago.
Recently I added a new test that adds readers to the mix. While the
instance directories were being added and deleted, readers would read
from these directories and even enable tracing within them. This test
was able to trigger a bug:
general protection fault: 0000 [#1] PREEMPT SMP
Modules linked in: ...
CPU: 3 PID: 17789 Comm: rmdir Tainted: G W 3.15.0-rc2-test+ #41
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
task: ffff88003786ca60 ti: ffff880077018000 task.ti: ffff880077018000
RIP: 0010:[<ffffffff811ed5eb>] [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367
RSP: 0018:ffff880077019df8 EFLAGS: 00010246
RAX: 0000000000000002 RBX: ffff88006f0fe490 RCX: 0000000000000000
RDX: dead000000100058 RSI: 0000000000000246 RDI: ffff88003786d454
RBP: ffff88006f0fe640 R08: 0000000000000628 R09: 0000000000000000
R10: 0000000000000628 R11: ffff8800795110a0 R12: ffff88006f0fe640
R13: ffff88006f0fe640 R14: ffffffff81817d0b R15: ffffffff818188b7
FS: 00007ff13ae24700(0000) GS:ffff88007d580000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000003054ec7be0 CR3: 0000000076d51000 CR4: 00000000000007e0
Stack:
ffff88007a41ebe0 dead000000100058 00000000fffffffe ffff88006f0fe640
0000000000000000 ffff88006f0fe678 ffff88007a41ebe0 ffff88003793a000
00000000fffffffe ffffffff810bde82 ffff88006f0fe640 ffff88007a41eb28
Call Trace:
[<ffffffff810bde82>] ? instance_rmdir+0x15b/0x1de
[<ffffffff81132e2d>] ? vfs_rmdir+0x80/0xd3
[<ffffffff81132f51>] ? do_rmdir+0xd1/0x139
[<ffffffff8124ad9e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[<ffffffff814fea62>] ? system_call_fastpath+0x16/0x1b
Code: fe ff ff 48 8d 75 30 48 89 df e8 c9 fd ff ff 85 c0 75 13 48 c7 c6 b8 cc d2 81 48 c7 c7 b0 cc d2 81 e8 8c 7a f5 ff 48 8b 54 24 08 <48> 8b 82 a8 00 00 00 48 89 d3 48 2d a8 00 00 00 48 89 44 24 08
RIP [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367
RSP <ffff880077019df8>
It took a while, but every time it triggered, it was always in the
same place:
list_for_each_entry_safe(child, next, &parent->d_subdirs, d_u.d_child) {
Where the child->d_u.d_child seemed to be corrupted. I added lots of
trace_printk()s to see what was wrong, and sure enough, it was always
the child's d_u.d_child field. I looked around to see what touches
it and noticed that in __dentry_kill() which calls dentry_free():
static void dentry_free(struct dentry *dentry)
{
/* if dentry was never visible to RCU, immediate free is OK */
if (!(dentry->d_flags & DCACHE_RCUACCESS))
__d_free(&dentry->d_u.d_rcu);
else
call_rcu(&dentry->d_u.d_rcu, __d_free);
}
I also noticed that __dentry_kill() unlinks the child->d_u.child
under the parent->d_lock spin_lock.
Looking back at the loop in debugfs_remove_recursive() it never takes the
parent->d_lock to do the list walk. Adding more tracing, I was able to
prove this was the issue:
ftrace-t-15385 1.... 246662024us : dentry_kill <ffffffff81138b91>: free ffff88006d573600
rmdir-15409 2.... 246662024us : debugfs_remove_recursive <ffffffff811ec7e5>: child=ffff88006d573600 next=dead000000100058
The dentry_kill freed ffff88006d573600 just as the remove recursive was walking
it.
In order to fix this, the list walk needs to be modified a bit to take
the parent->d_lock. The safe version is no longer necessary, as every
time we remove a child, the parent->d_lock must be released and the
list walk must start over. Each time a child is removed, even though it
may still be on the list, it should be skipped by the first check
in the loop:
if (!debugfs_positive(child))
continue;
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 86f0afd463215fc3e58020493482faa4ac3a4d69 upstream.
If there is a failure while allocating the preallocation structure, a
number of blocks can end up getting marked in the in-memory buddy
bitmap, and then not getting released. This can result in the
following corruption getting reported by the kernel:
EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126,
12793 clusters in bitmap, 12729 in gd
In that case, we need to release the blocks using mb_free_blocks().
Tested: fs smoke test; also demonstrated that with injected errors,
the file system is no longer getting corrupted
Google-Bug-Id: 16657874
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 410dd3cf4c9b36f27ed4542ee18b1af5e68645a4 upstream.
We did not check relocated directory in any way when processing Rock
Ridge 'CL' tag. Thus a corrupted isofs image can possibly have a CL
entry pointing to another CL entry leading to possibly unbounded
recursion in kernel code and thus stack overflow or deadlocks (if there
is a loop created from CL entries).
Fix the problem by not allowing CL entry to point to a directory entry
with CL entry (such use makes no good sense anyway) and by checking
whether CL entry doesn't point to itself.
Reported-by: Chris Evans <cevans@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 110dc24ad2ae4e9b94b08632fe1eb2fcdff83045 upstream.
The addition of direct formatting of log items into the CIL
linear buffer added alignment restrictions that the start of each
vector needed to be 64 bit aligned. Hence padding was added in
xlog_finish_iovec() to round up the vector length to ensure the next
vector started with the correct alignment.
This adds a small number of bytes to the size of
the linear buffer that is otherwise unused. The issue is that we
then use the linear buffer size to determine the log space used by
the log item, and this includes the unused space. Hence when we
account for space used by the log item, it's more than is actually
written into the iclogs, and hence we slowly leak this space.
This results on log hangs when reserving space, with threads getting
stuck with these stack traces:
Call Trace:
[<ffffffff81d15989>] schedule+0x29/0x70
[<ffffffff8150d3a2>] xlog_grant_head_wait+0xa2/0x1a0
[<ffffffff8150d55d>] xlog_grant_head_check+0xbd/0x140
[<ffffffff8150ee33>] xfs_log_reserve+0x103/0x220
[<ffffffff814b7f05>] xfs_trans_reserve+0x2f5/0x310
.....
The 4 bytes is significant. Brain Foster did all the hard work in
tracking down a reproducable leak to inode chunk allocation (it went
away with the ikeep mount option). His rough numbers were that
creating 50,000 inodes leaked 11 log blocks. This turns out to be
roughly 800 inode chunks or 1600 inode cluster buffers. That
works out at roughly 4 bytes per cluster buffer logged, and at that
I started looking for a 4 byte leak in the buffer logging code.
What I found was that a struct xfs_buf_log_format structure for an
inode cluster buffer is 28 bytes in length. This gets rounded up to
32 bytes, but the vector length remains 28 bytes. Hence the CIL
ticket reservation is decremented by 32 bytes (via lv->lv_buf_len)
for that vector rather than 28 bytes which are written into the log.
The fix for this problem is to separately track the bytes used by
the log vectors in the item and use that instead of the buffer
length when accounting for the log space that will be used by the
formatted log item.
Again, thanks to Brian Foster for doing all the hard work and long
hours to isolate this leak and make finding the bug relatively
simple.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Cc: Bill <billstuff2001@sbcglobal.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 295dc39d941dc2ae53d5c170365af4c9d5c16212 upstream.
Currently umount on symlink blocks following umount:
/vz is separate mount
# ls /vz/ -al | grep test
drwxr-xr-x. 2 root root 4096 Jul 19 01:14 testdir
lrwxrwxrwx. 1 root root 11 Jul 19 01:16 testlink -> /vz/testdir
# umount -l /vz/testlink
umount: /vz/testlink: not mounted (expected)
# lsof /vz
# umount /vz
umount: /vz: device is busy. (unexpected)
In this case mountpoint_last() gets an extra refcount on path->mnt
Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Ian Kent <raven@themaw.net>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aed8adb7688d5744cb484226820163af31d2499a upstream.
Commit 079148b919d0 ("coredump: factor out the setting of PF_DUMPCORE")
cleaned up the setting of PF_DUMPCORE by removing it from all the
linux_binfmt->core_dump() and moving it to zap_threads().But this ended
up clearing all the previously set flags. This causes issues during
core generation when tsk->flags is checked again (eg. for PF_USED_MATH
to dump floating point registers). Fix this.
Signed-off-by: Silesh C V <svellattu@mvista.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 74adf83f5d7720925499b4938f930591f947b660 upstream.
The big ACL switched nfs to use generic_listxattr, which calls all existing
->list handlers. Add a custom .listxattr implementation that only lists
the ACLs if they actually are present on the given inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Philippe Troin <phil@fifi.org>
Tested-by: Philippe Troin <phil@fifi.org>
Fixes: 013cdf1088d7 (nfs: use generic posix ACL infrastructure ...)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 263782c1c95bbddbb022dc092fd89a36bb8d5577 upstream.
As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to
have put_reqs_available() called from irq context. While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU. This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott. Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.
Many thanks to Robert and folks for testing and tracking this down.
Reported-by: Robert Elliot <Elliott@hp.com>
Tested-by: Robert Elliot <Elliott@hp.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d68aab6b8f572406aa93b45ef6483934dd3b54a6 upstream.
Commit 1ab6c4997e04 (fs: convert fs shrinkers to new scan/count API)
accidentally removed locking from quota shrinker. Fix it -
dqcache_shrink_scan() should use dq_list_lock to protect the
scan on free_dquots list.
Fixes: 1ab6c4997e04a00c50c6d786c2f046adc0d1f5de
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 154210ccb3a871e631bf39fdeb7a8731d98af87b upstream.
The following test case demonstrates the bug:
sh# mount -t glusterfs localhost:meta-test /mnt/one
sh# mount -t glusterfs localhost:meta-test /mnt/two
sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; echo stuff > /mnt/one/file
bash: /mnt/one/file: Stale file handle
sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff > /mnt/one/file
On the second open() on /mnt/one, FUSE would have used the old
nodeid (file handle) trying to re-open it. Gluster is returning
-ESTALE. The ESTALE propagates back to namei.c:filename_lookup()
where lookup is re-attempted with LOOKUP_REVAL. The right
behavior now, would be for FUSE to ignore the entry-timeout and
and do the up-call revalidation. Instead FUSE is ignoring
LOOKUP_REVAL, succeeding the revalidation (because entry-timeout
has not passed), and open() is again retried on the old file
handle and finally the ESTALE is going back to the application.
Fix: if revalidation is happening with LOOKUP_REVAL, then ignore
entry-timeout and always do the up-call.
Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 233a01fa9c4c7c41238537e8db8434667ff28a2f upstream.
If the number in "user_id=N" or "group_id=N" mount options was larger than
INT_MAX then fuse returned EINVAL.
Fix this to handle all valid uid/gid values.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 126b9d4365b110c157bc4cbc32540dfa66c9c85a upstream.
As suggested by checkpatch.pl, use time_before64() instead of direct
comparison of jiffies64 values.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3f1f9b851311a76226140b55b1ea22111234a7c2 upstream.
This fixes the following lockdep complaint:
[ INFO: possible circular locking dependency detected ]
3.16.0-rc2-mm1+ #7 Tainted: G O
-------------------------------------------------------
kworker/u24:0/4356 is trying to acquire lock:
(&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0
but task is already holding lock:
(&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180
which lock already depends on the new lock.
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&ei->i_es_lock);
lock(&(&sbi->s_es_lru_lock)->rlock);
lock(&ei->i_es_lock);
lock(&(&sbi->s_es_lru_lock)->rlock);
*** DEADLOCK ***
6 locks held by kworker/u24:0/4356:
#0: ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
#1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
#2: (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90
#3: (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0
#4: (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550
#5: (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180
stack backtrace:
CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G O 3.16.0-rc2-mm1+ #7
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: writeback bdi_writeback_workfn (flush-253:0)
ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007
ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568
ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930
Call Trace:
[<ffffffff815df0bb>] dump_stack+0x4e/0x68
[<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c
[<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00
[<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe
[<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce
[<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
[<ffffffff810a8707>] lock_acquire+0x87/0x120
[<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
[<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70
[<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50
[<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
[<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0
[<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0
[<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180
[<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550
[<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00
...
Reported-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Minchan Kim <minchan@kernel.org>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5dd214248f94d430d70e9230bda72f2654ac88a8 upstream.
The mount manpage says of the max_batch_time option,
This optimization can be turned off entirely
by setting max_batch_time to 0.
But the code doesn't do that. So fix the code to do
that.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 94d4c066a4ff170a2671b1a9b153febbf36796f6 upstream.
We are spending a lot of time explaining to users what this error
means. Let's try to improve the message to avoid this problem.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ae0f78de2c43b6fadd007c231a352b13b5be8ed2 upstream.
Make it clear that values printed are times, and that it is error
since last fsck. Also add note about fsck version required.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 61c219f5814277ecb71d64cb30297028d6665979 upstream.
The first time that we allocate from an uninitialized inode allocation
bitmap, if the block allocation bitmap is also uninitalized, we need
to get write access to the block group descriptor before we start
modifying the block group descriptor flags and updating the free block
count, etc. Otherwise, there is the potential of a bad journal
checksum (if journal checksums are enabled), and of the file system
becoming inconsistent if we crash at exactly the wrong time.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e02ba72aabfade4c9cd6e3263e9b57bf890ad25c upstream.
deletes aio context and all resources related to. It makes sense that
no IO operations connected to the context should be running after the context
is destroyed. As we removed io_context we have no chance to
get requests status or call io_getevents().
man page for io_destroy says that this function may block until
all context's requests are completed. Before kernel 3.11 io_destroy()
blocked indeed, but since aio refactoring in 3.11 it is not true anymore.
Here is a pseudo-code that shows a testcase for a race condition discovered
in 3.11:
initialize io_context
io_submit(read to buffer)
io_destroy()
// context is destroyed so we can free the resources
free(buffers);
// if the buffer is allocated by some other user he'll be surprised
// to learn that the buffer still filled by an outstanding operation
// from the destroyed io_context
The fix is straight-forward - add a completion struct and wait on it
in io_destroy, complete() should be called when number of in-fligh requests
reaches zero.
If two or more io_destroy() called for the same context simultaneously then
only the first one waits for IO completion, other calls behaviour is undefined.
Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
do not see the race condition anymore.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 76f47128f9b33af1e96819746550d789054c9664 upstream.
An NFS operation that creates a new symlink includes the symlink data,
which is xdr-encoded as a length followed by the data plus 0 to 3 bytes
of zero-padding as required to reach a 4-byte boundary.
The vfs, on the other hand, wants null-terminated data.
The simple way to handle this would be by copying the data into a newly
allocated buffer with space for the final null.
The current nfsd_symlink code tries to be more clever by skipping that
step in the (likely) case where the byte following the string is already
0.
But that assumes that the byte following the string is ours to look at.
In fact, it might be the first byte of a page that we can't read, or of
some object that another task might modify.
Worse, the NFSv4 code tries to fix the problem by actually writing to
that byte.
In the NFSv2/v3 cases this actually appears to be safe:
- nfs3svc_decode_symlinkargs explicitly null-terminates the data
(after first checking its length and copying it to a new
page).
- NFSv2 limits symlinks to 1k. The buffer holding the rpc
request is always at least a page, and the link data (and
previous fields) have maximum lengths that prevent the request
from reaching the end of a page.
In the NFSv4 case the CREATE op is potentially just one part of a long
compound so can end up on the end of a page if you're unlucky.
The minimal fix here is to copy and null-terminate in the NFSv4 case.
The nfsd_symlink() interface here seems too fragile, though. It should
really either do the copy itself every time or just require a
null-terminated string.
Reported-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a93cd4cf86466caa49cfe64607bea7f0bde3f916 upstream.
Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:
fallocate -n -p 10240000 4096
will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.
Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c5c7b8ddfbf8cb3b2291e515a34ab1b8982f5a2d upstream.
Error recovery in ext4_alloc_branch() calls ext4_forget() even for
buffer corresponding to indirect block it did not allocate. This leads
to brelse() being called twice for that buffer (once from ext4_forget()
and once from cleanup in ext4_ind_map_blocks()) leading to buffer use
count misaccounting. Eventually (but often much later because there
are other users of the buffer) we will see messages like:
VFS: brelse: Trying to free free buffer
Another manifestation of this problem is an error:
JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);
inconsistent data on disk
The fix is easy - don't forget buffer we did not allocate. Also add an
explanatory comment because the indexing at ext4_alloc_branch() is
somewhat subtle.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
option
commit ce36d9ab3bab06b7b5522f5c8b68fac231b76ffb upstream.
When we SMB3 mounted with mapchars (to allow reserved characters : \ / > < * ?
via the Unicode Windows to POSIX remap range) empty paths
(eg when we open "" to query the root of the SMB3 directory on mount) were not
null terminated so we sent garbarge as a path name on empty paths which caused
SMB2/SMB2.1/SMB3 mounts to fail when mapchars was specified. mapchars is
particularly important since Unix Extensions for SMB3 are not supported (yet)
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a1d0b84c308d7fdfb67eb76498116a6c2fdda507 upstream.
commit d81b8a40e2ece0a9ab57b1fe1798e291e75bf8fc
("CIFS: Cleanup cifs open codepath")
changed disposition to FILE_OPEN.
Signed-off-by: Björn Baumbach <bb@sernet.de>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Cc: Pavel Shilovsky <piastry@etersoft.ru>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 22e7478ddbcb670e33fab72d0bbe7c394c3a2c84 upstream.
Prior to commit 0e4f6a791b1e (Fix reiserfs_file_release()), reiserfs
truncates serialized on i_mutex. They mostly still do, with the exception
of reiserfs_file_release. That blocks out other writers via the tailpack
mutex and the inode openers counter adjusted in reiserfs_file_open.
However, NFS will call reiserfs_setattr without having called ->open, so
we end up with a race when nfs is calling ->setattr while another
process is releasing the file. Ultimately, it triggers the
BUG_ON(inode->i_size != new_file_size) check in maybe_indirect_to_direct.
The solution is to pull the lock into reiserfs_setattr to encompass the
truncate_setsize call as well.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 556b8883cfac3d3203557e161ea8005f8b5479b2 upstream.
Commit daba542 ("xfs: skip verification on initial "guess"
superblock read") dropped the use of a verifier for the initial
superblock read so we can probe the sector size of the filesystem
stored in the superblock. It, however, now fails to validate that
what was read initially is actually an XFS superblock and hence will
fail the sector size check and return ENOSYS.
This causes probe-based mounts to fail because it expects XFS to
return EINVAL when it doesn't recognise the superblock format.
Reported-by: Plamen Petrov <plamen.sisi@gmail.com>
Tested-by: Plamen Petrov <plamen.sisi@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|