aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2012-01-12NFS4: fix compile warnings in nfs4proc.cPeng Tao
compile in nfs-for-3.3 branch shows following warnings. Fix it here. fs/nfs/nfs4proc.c: In function ‘__nfs4_get_acl_uncached’: fs/nfs/nfs4proc.c:3589: warning: format ‘%ld’ expects type ‘long int’, but argument 4 has type ‘size_t’ fs/nfs/nfs4proc.c:3589: warning: format ‘%ld’ expects type ‘long int’, but argument 6 has type ‘size_t’ Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-12nfs: check for integer overflow in decode_devicenotify_args()Dan Carpenter
On 32 bit, if n is too large then "n * sizeof(*args->devs)" could overflow and args->devs would be smaller than expected. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-12NFS: cleanup endian type in decode_ds_addr()Dan Carpenter
port is supposed to be a __be16 here. The existing code should work fine, but this is a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-12NFS: add an endian notationDan Carpenter
This function returns a big endian value. The implementation in fs/nfs/callback_proc.c is declared with "__be32" but the .h file uses "unsigned" instead. It makes sparse complain: fs/nfs/callback_proc.c:232:8: error: symbol 'nfs4_callback_layoutrecall' redeclared with different type (originally declared at fs/nfs/callback.h:165) - different base types Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-01-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: FUSE: Notifying the kernel of deletion. fuse: support ioctl on directories fuse: Use kcalloc instead of kzalloc to allocate array fuse: llseek optimize SEEK_CUR and SEEK_SET
2012-01-12ceph: ensure prealloc_blob is in place when removing xattrAlex Elder
In __ceph_build_xattrs_blob(), if a ceph inode's extended attributes are marked dirty, all attributes recorded in its rb_tree index are formatted into a "blob" buffer. The target buffer is recorded in ceph_inode->i_xattrs.prealloc_blob, and it is expected to exist and be of sufficient size to hold the attributes. The extended attributes are marked dirty in two cases: when a new attribute is added to the inode; or when one is removed. In the former case work is done to ensure the prealloc_blob buffer is properly set up, but in the latter it is not. Change the logic in ceph_removexattr() so it matches what is done in ceph_setxattr(). Note that this is done in a way that keeps the two blocks of code nearly identical, in anticipation of a subsequent patch that encapsulates some of this logic into one or more helper routines. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-12ceph: enable/disable dentry complete flags via mount optionSage Weil
Enable/disable use of the dentry dir 'complete' flag via a mount option. This lets the admin control whether ceph uses the dcache to satisfy negative lookups or readdir when it has the entire directory contents in its cache. This is purely a performance optimization; correctness is guaranteed whether it is enabled or not. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-12vfs: export symbol d_find_any_alias()Sage Weil
Ceph needs this. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-12fs: remove unneeded plug in mpage_readpages()Namjae Jeon
The block plug in mpage_readpages() duplicates the one in read_pages(). Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-11ceph: always initialize the dentry in open_root_dentry()Alex Elder
When open_root_dentry() gets a dentry via d_obtain_alias() it does not get initialized. If the dentry obtained came from the cache, this is OK. But if not, the result is an improperly initialized dentry. To fix this, call ceph_init_dentry() regardless of which path produced the dentry. That function returns immediately for a dentry that is already initialized, it is safe to use either way. (Credit to Sage, who suggested this fix.) Signed-off-by: Alex Elder <aelder@sgi.com>
2012-01-11UBIFS: fix debugging messagesArtem Bityutskiy
Patch 56e46742e846e4de167dde0e1e1071ace1c882a5 broke UBIFS debugging messages: before that commit when UBIFS debugging was enabled, users saw few useful debugging messages after mount. However, that patch turned 'dbg_msg()' into 'pr_debug()', so to enable the debugging messages users have to enable them first via /sys/kernel/debug/dynamic_debug/control, which is very impractical. This commit makes 'dbg_msg()' to use 'printk()' instead of 'pr_debug()', just as it was before the breakage. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: stable@kernel.org [3.0+]
2012-01-11UBIFS: make debugging messages light againArtem Bityutskiy
We switch to dynamic debugging in commit 56e46742e846e4de167dde0e1e1071ace1c882a5 but did not take into account that now we do not control anymore whether a specific message is enabled or not. So now we lock the "dbg_lock" and release it in every debugging macro, which make them not so light-weight. This commit removes the "dbg_lock" protection from the debugging macros to fix the issue. The downside is that now our DBGKEY() stuff is broken, but this is not critical at all and will be fixed later. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: stable@kernel.org [3.0+]
2012-01-11GFS2: Fix nlink setting on inode creationSteven Whitehouse
Since the nlink count will be 0, we need to use set_nlink rather than inc_nlink in order to avoid triggering the inc_nlink warning which was added recently. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-01-11GFS2: fail mount if journal recovery failsDavid Teigland
If the first mounter fails to recover one of the journals during mount, the mount should fail. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-01-11GFS2: let spectator mount do read only recoveryDavid Teigland
Previously, a spectator mount would not even attempt to do journal recovery for a failed node. This meant that if all mounted nodes were spectators, everyone would be stuck after a node failed, all waiting for recovery to be performed. This is unnecessary since the failed node had a clean journal. Instead, allow a spectator mount to do a partial "read only" recovery, which means it will check if the failed journal is clean, and if so, report a successful recovery. If the failed journal is not clean, it reports that journal recovery failed. This makes it work the same as a read only mount on a read only block device. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-01-11GFS2: Fix a use-after-free that coverity spottedBob Peterson
In function gfs2_inplace_release it was trying to unlock a gfs2_holder structure associated with a reservation, after said reservation was freed. The problem is that the statements have the wrong order. This patch corrects the order so that the reservation is freed after the gfs2_holder is unlocked. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-01-11GFS2: dlm based recovery coordinationDavid Teigland
This new method of managing recovery is an alternative to the previous approach of using the userland gfs_controld. - use dlm slot numbers to assign journal id's - use dlm recovery callbacks to initiate journal recovery - use a dlm lock to determine the first node to mount fs - use a dlm lock to track journals that need recovery Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-01-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: autofs4: deal with autofs4_write/autofs4_write races autofs4: catatonic_mode vs. notify_daemon race autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race hfsplus: creation of hidden dir on mount can fail block_dev: Suppress bdev_cache_init() kmemleak warninig fix shrink_dcache_parent() livelock coda: switch coda_cnode_make() to sane API as well, clean coda_lookup() coda: deal correctly with allocation failure from coda_cnode_makectl() securityfs: fix object creation races
2012-01-11autofs4: deal with autofs4_write/autofs4_write racesAl Viro
Just serialize the actual writing of packets into pipe on a new mutex, independent from everything else in the locking hierarchy. As soon as something has started feeding a piece of packet into the pipe to daemon, we *want* everything else about to try the same to wait until we are done. Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-11autofs4: catatonic_mode vs. notify_daemon raceAl Viro
we need to hold ->wq_mutex while we are forming the packet to send, lest we have autofs4_catatonic_mode() setting wq->name.name to NULL just as autofs4_notify_daemon() decides to memcpy() from it... We do have check for catatonic mode immediately after that (under ->wq_mutex, as it ought to be) and packet won't be actually sent, but it'll be too late for us if we oops on that memcpy() from NULL... Fix is obvious - just extend the area covered by ->wq_mutex over that switch and check whether it's catatonic *before* doing anything else. Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-11autofs4: autofs4_wait() vs. autofs4_catatonic_mode() raceAl Viro
We need to recheck ->catatonic after autofs4_wait() got ->wq_mutex for good, or we might end up with wq inserted into queue after autofs4_catatonic_mode() had done its thing. It will stick there forever, since there won't be anything to clear its ->name.name. A bit of a complication: validate_request() drops and regains ->wq_mutex. It actually ends up the most convenient place to stick the check into... Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-11Btrfs: fix possible deadlock when opening a seed deviceLi Zefan
The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex, but when we mount a filesystem which has backing seed devices, we have this lock chain: open_ctree() lock(chunk_mutex); read_chunk_tree(); read_one_dev(); open_seed_devices(); lock(uuid_mutex); and then we hit a lockdep splat. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: update global block_rsv when creating a new block groupLi Zefan
A bug was triggered while using seed device: # mkfs.btrfs /dev/loop1 # btrfstune -S 1 /dev/loop1 # mount -o /dev/loop1 /mnt # btrfs dev add /dev/loop2 /mnt btrfs: block rsv returned -28 ------------[ cut here ]------------ WARNING: at fs/btrfs/extent-tree.c:5969 btrfs_alloc_free_block+0x166/0x396 [btrfs]() ... Call Trace: ... [<f7b7c31c>] btrfs_cow_block+0x101/0x147 [btrfs] [<f7b7eaa6>] btrfs_search_slot+0x1b8/0x55f [btrfs] [<f7b7f844>] btrfs_insert_empty_items+0x42/0x7f [btrfs] [<f7b7f8c1>] btrfs_insert_item+0x40/0x7e [btrfs] [<f7b8ac02>] btrfs_make_block_group+0x243/0x2aa [btrfs] [<f7bb3f53>] __btrfs_alloc_chunk+0x672/0x70e [btrfs] [<f7bb41ff>] init_first_rw_device+0x77/0x13c [btrfs] [<f7bb5a62>] btrfs_init_new_device+0x664/0x9fd [btrfs] [<f7bbb65a>] btrfs_ioctl+0x694/0xdbe [btrfs] [<c04f55f7>] do_vfs_ioctl+0x496/0x4cc [<c04f5660>] sys_ioctl+0x33/0x4f [<c07b9edf>] sysenter_do_call+0x12/0x38 ---[ end trace 906adac595facc7d ]--- Since seed device is readonly, there's no usable space in the filesystem. Afterwards we add a sprout device to it, and the kernel creates a METADATA block group and a SYSTEM block group where comes free space we can reserve, but we still get revervation failure because the global block_rsv hasn't been updated accordingly. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: rewrite btrfs_trim_block_group()Li Zefan
There are various bugs in block group trimming: - It may trim from offset smaller than user-specified offset. - It may trim beyond user-specified range. - It may leak free space for extents smaller than specified minlen. - It may truncate the last trimmed extent thus leak free space. - With mixed extents+bitmaps, some extents may not be trimmed. - With mixed extents+bitmaps, some bitmaps may not be trimmed (even none will be trimmed). Even for those trimmed, not all the free space in the bitmaps will be trimmed. I rewrite btrfs_trim_block_group() and break it into two functions. One is to trim extents only, and the other is to trim bitmaps only. Before patching: # fstrim -v /mnt/ /mnt/: 1496465408 bytes were trimmed After patching: # fstrim -v /mnt/ /mnt/: 2193768448 bytes were trimmed And this matches the total free space: # btrfs fi df /mnt Data: total=3.58GB, used=1.79GB System, DUP: total=8.00MB, used=4.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=205.12MB, used=97.14MB Metadata: total=8.00MB, used=0.00 Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: simplfy calculation of stripe length for discard operationLi Zefan
For btrfs raid, while discarding a range of space, we'll need to know the start offset and length to discard for each device, and it's done in btrfs_map_block(). However the calculation is a bit complex for raid0 and raid10, so I reimplement it based on a fact that: dev1 dev2 dev3 (raid0) ----------------------------------- s0 s3 s6 s1 s4 s7 s2 s5 Each device has (total_stripes / nr_dev) stripes, or plus one. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: don't pre-allocate btrfs bioLi Zefan
We pre-allocate a btrfs bio with fixed size, and then may re-allocate memory if we find stripes are bigger than the fixed size. But this pre-allocation is not necessary. Also we don't have to calcuate the stripe number twice. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: don't pass a trans handle unnecessarily in volumes.cLi Zefan
Some functions never use the transaction handle passed to them. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: reserve metadata space in btrfs_ioctl_setflags()Li Zefan
Check and reserve space for btrfs_update_inode(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: remove BUG_ON()s in btrfs_ioctl_setflags()Li Zefan
We can recover from errors and return -errno to user space. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: check the return value of io_ctl_init()Li Zefan
It can return -ENOMEM. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: avoid possible NULL deref in io_ctl_drop_pages()Li Zefan
If we run into some failure path in io_ctl_prepare_pages(), io_ctl->pages[] array may have some NULL pointers. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Btrfs: add pinned extents to on-disk free space cache correctlyLi Zefan
I got this while running xfstests: [24256.836098] block group 317849600 has an wrong amount of free space [24256.836100] btrfs: failed to load free space cache for block group 317849600 We should clamp the extent returned by find_first_extent_bit(), so the start of the extent won't smaller than the start of the block group. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11Merge branch 'for-linus' of ↵Li Zefan
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs into for-linus
2012-01-10Merge branch 'writeback-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c writeback: balanced_rate cannot exceed write bandwidth writeback: do strict bdi dirty_exceeded writeback: avoid tiny dirty poll intervals writeback: max, min and target dirty pause time writeback: dirty ratelimit - think time compensation btrfs: fix dirtied pages accounting on sub-page writes writeback: fix dirtied pages accounting on redirty writeback: fix dirtied pages accounting on sub-page writes writeback: charge leaked page dirties to active tasks writeback: Include all dirty inodes in background writeback
2012-01-10Merge branch 'akpm' (aka "Andrew's patch-bomb")Linus Torvalds
Andrew elucidates: - First installmeant of MM. We have a HUGE number of MM patches this time. It's crazy. - MAINTAINERS updates - backlight updates - leds - checkpatch updates - misc ELF stuff - rtc updates - reiserfs - procfs - some misc other bits * akpm: (124 commits) user namespace: make signal.c respect user namespaces workqueue: make alloc_workqueue() take printf fmt and args for name procfs: add hidepid= and gid= mount options procfs: parse mount options procfs: introduce the /proc/<pid>/map_files/ directory procfs: make proc_get_link to use dentry instead of inode signal: add block_sigmask() for adding sigmask to current->blocked sparc: make SA_NOMASK a synonym of SA_NODEFER reiserfs: don't lock root inode searching reiserfs: don't lock journal_init() reiserfs: delay reiserfs lock until journal initialization reiserfs: delete comments referring to the BKL drivers/rtc/interface.c: fix alarm rollover when day or month is out-of-range drivers/rtc/rtc-twl.c: add DT support for RTC inside twl4030/twl6030 drivers/rtc/: remove redundant spi driver bus initialization drivers/rtc/rtc-jz4740.c: make jz4740_rtc_driver static drivers/rtc/rtc-mc13xxx.c: make mc13xxx_rtc_idtable static rtc: convert drivers/rtc/* to use module_platform_driver() drivers/rtc/rtc-wm831x.c: convert to devm_kzalloc() drivers/rtc/rtc-wm831x.c: remove unused period IRQ handler ...
2012-01-10procfs: add hidepid= and gid= mount optionsVasiliy Kulikov
Add support for mount options to restrict access to /proc/PID/ directories. The default backward-compatible "relaxed" behaviour is left untouched. The first mount option is called "hidepid" and its value defines how much info about processes we want to be available for non-owners: hidepid=0 (default) means the old behavior - anybody may read all world-readable /proc/PID/* files. hidepid=1 means users may not access any /proc/<pid>/ directories, but their own. Sensitive files like cmdline, sched*, status are now protected against other users. As permission checking done in proc_pid_permission() and files' permissions are left untouched, programs expecting specific files' modes are not confused. hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other users. It doesn't mean that it hides whether a process exists (it can be learned by other means, e.g. by kill -0 $PID), but it hides process' euid and egid. It compicates intruder's task of gathering info about running processes, whether some daemon runs with elevated privileges, whether another user runs some sensitive program, whether other users run any program at all, etc. gid=XXX defines a group that will be able to gather all processes' info (as in hidepid=0 mode). This group should be used instead of putting nonroot user in sudoers file or something. However, untrusted users (like daemons, etc.) which are not supposed to monitor the tasks in the whole system should not be added to the group. hidepid=1 or higher is designed to restrict access to procfs files, which might reveal some sensitive private information like precise keystrokes timings: http://www.openwall.com/lists/oss-security/2011/11/05/3 hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and conky gracefully handle EPERM/ENOENT and behave as if the current user is the only user running processes. pstree shows the process subtree which contains "pstree" process. Note: the patch doesn't deal with setuid/setgid issues of keeping preopened descriptors of procfs files (like https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked information like the scheduling counters of setuid apps doesn't threaten anybody's privacy - only the user started the setuid program may read the counters. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg KH <greg@kroah.com> Cc: Theodore Tso <tytso@MIT.EDU> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: James Morris <jmorris@namei.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10procfs: parse mount optionsVasiliy Kulikov
Add support for procfs mount options. Actual mount options are coming in the next patches. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Greg KH <greg@kroah.com> Cc: Theodore Tso <tytso@MIT.EDU> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: James Morris <jmorris@namei.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10procfs: introduce the /proc/<pid>/map_files/ directoryPavel Emelyanov
This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink results in a file that point exactly to the same inode as them vma's one. For example the ls -l of some arbitrary /proc/<pid>/map_files/ | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so This *helps* checkpointing process in three ways: 1. When dumping a task mappings we do know exact file that is mapped by particular region. We do this by opening /proc/$pid/map_files/$address symlink the way we do with file descriptors. 2. This also helps in determining which anonymous shared mappings are shared with each other by comparing the inodes of them. 3. When restoring a set of processes in case two of them has a mapping shared, we map the memory by the 1st one and then open its /proc/$pid/map_files/$address file and map it by the 2nd task. Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable re-reading and reparsing for this text file which slows down restore procedure significantly. Also as being pointed in (3) it is a way easier to use top level shared mapping in children as /proc/$pid/map_files/$address when needed. [akpm@linux-foundation.org: coding-style fixes] [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE] Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Vasiliy Kulikov <segoon@openwall.com> Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10procfs: make proc_get_link to use dentry instead of inodeCyrill Gorcunov
Prepare the ground for the next "map_files" patch which needs a name of a link file to analyse. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10reiserfs: don't lock root inode searchingFrederic Weisbecker
Nothing requires that we lock the filesystem until the root inode is provided. Also iget5_locked() triggers a warning because we are holding the filesystem lock while allocating the inode, which result in a lockdep suspicion that we have a lock inversion against the reclaim path: [ 1986.896979] ================================= [ 1986.896990] [ INFO: inconsistent lock state ] [ 1986.896997] 3.1.1-main #8 [ 1986.897001] --------------------------------- [ 1986.897007] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [ 1986.897016] kswapd0/16 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 1986.897023] (&REISERFS_SB(s)->lock){+.+.?.}, at: [<c01f8bd4>] reiserfs_write_lock+0x20/0x2a [ 1986.897044] {RECLAIM_FS-ON-W} state was registered at: [ 1986.897050] [<c014a5b9>] mark_held_locks+0xae/0xd0 [ 1986.897060] [<c014aab3>] lockdep_trace_alloc+0x7d/0x91 [ 1986.897068] [<c0190ee0>] kmem_cache_alloc+0x1a/0x93 [ 1986.897078] [<c01e7728>] reiserfs_alloc_inode+0x13/0x3d [ 1986.897088] [<c01a5b06>] alloc_inode+0x14/0x5f [ 1986.897097] [<c01a5cb9>] iget5_locked+0x62/0x13a [ 1986.897106] [<c01e99e0>] reiserfs_fill_super+0x410/0x8b9 [ 1986.897114] [<c01953da>] mount_bdev+0x10b/0x159 [ 1986.897123] [<c01e764d>] get_super_block+0x10/0x12 [ 1986.897131] [<c0195b38>] mount_fs+0x59/0x12d [ 1986.897138] [<c01a80d1>] vfs_kern_mount+0x45/0x7a [ 1986.897147] [<c01a83e3>] do_kern_mount+0x2f/0xb0 [ 1986.897155] [<c01a987a>] do_mount+0x5c2/0x612 [ 1986.897163] [<c01a9a72>] sys_mount+0x61/0x8f [ 1986.897170] [<c044060c>] sysenter_do_call+0x12/0x32 [ 1986.897181] irq event stamp: 7509691 [ 1986.897186] hardirqs last enabled at (7509691): [<c0190f34>] kmem_cache_alloc+0x6e/0x93 [ 1986.897197] hardirqs last disabled at (7509690): [<c0190eea>] kmem_cache_alloc+0x24/0x93 [ 1986.897209] softirqs last enabled at (7508896): [<c01294bd>] __do_softirq+0xee/0xfd [ 1986.897222] softirqs last disabled at (7508859): [<c01030ed>] do_softirq+0x50/0x9d [ 1986.897234] [ 1986.897235] other info that might help us debug this: [ 1986.897242] Possible unsafe locking scenario: [ 1986.897244] [ 1986.897250] CPU0 [ 1986.897254] ---- [ 1986.897257] lock(&REISERFS_SB(s)->lock); [ 1986.897265] <Interrupt> [ 1986.897269] lock(&REISERFS_SB(s)->lock); [ 1986.897276] [ 1986.897277] *** DEADLOCK *** [ 1986.897278] [ 1986.897286] no locks held by kswapd0/16. [ 1986.897291] [ 1986.897292] stack backtrace: [ 1986.897299] Pid: 16, comm: kswapd0 Not tainted 3.1.1-main #8 [ 1986.897306] Call Trace: [ 1986.897314] [<c0439e76>] ? printk+0xf/0x11 [ 1986.897324] [<c01482d1>] print_usage_bug+0x20e/0x21a [ 1986.897332] [<c01479b8>] ? print_irq_inversion_bug+0x172/0x172 [ 1986.897341] [<c014855c>] mark_lock+0x27f/0x483 [ 1986.897349] [<c0148d88>] __lock_acquire+0x628/0x1472 [ 1986.897358] [<c0149fae>] lock_acquire+0x47/0x5e [ 1986.897366] [<c01f8bd4>] ? reiserfs_write_lock+0x20/0x2a [ 1986.897384] [<c01f8bd4>] ? reiserfs_write_lock+0x20/0x2a [ 1986.897397] [<c043b5ef>] mutex_lock_nested+0x35/0x26f [ 1986.897409] [<c01f8bd4>] ? reiserfs_write_lock+0x20/0x2a [ 1986.897421] [<c01f8bd4>] reiserfs_write_lock+0x20/0x2a [ 1986.897433] [<c01e2edd>] map_block_for_writepage+0xc9/0x590 [ 1986.897448] [<c01b1706>] ? create_empty_buffers+0x33/0x8f [ 1986.897461] [<c0121124>] ? get_parent_ip+0xb/0x31 [ 1986.897472] [<c043ef7f>] ? sub_preempt_count+0x81/0x8e [ 1986.897485] [<c043cae0>] ? _raw_spin_unlock+0x27/0x3d [ 1986.897496] [<c0121124>] ? get_parent_ip+0xb/0x31 [ 1986.897508] [<c01e355d>] reiserfs_writepage+0x1b9/0x3e7 [ 1986.897521] [<c0173b40>] ? clear_page_dirty_for_io+0xcb/0xde [ 1986.897533] [<c014a6e3>] ? trace_hardirqs_on_caller+0x108/0x138 [ 1986.897546] [<c014a71e>] ? trace_hardirqs_on+0xb/0xd [ 1986.897559] [<c0177b38>] shrink_page_list+0x34f/0x5e2 [ 1986.897572] [<c01780a7>] shrink_inactive_list+0x172/0x22c [ 1986.897585] [<c0178464>] shrink_zone+0x303/0x3b1 [ 1986.897597] [<c043cae0>] ? _raw_spin_unlock+0x27/0x3d [ 1986.897611] [<c01788c9>] kswapd+0x3b7/0x5f2 The deadlock shouldn't happen since we are doing that allocation in the mount path, the filesystem is not available for any reclaim. Still the warning is annoying. To solve this, acquire the lock later only where we need it, right before calling reiserfs_read_locked_inode() that wants to lock to walk the tree. Reported-by: Knut Petersen <Knut_Petersen@t-online.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10reiserfs: don't lock journal_init()Frederic Weisbecker
journal_init() doesn't need the lock since no operation on the filesystem is involved there. journal_read() and get_list_bitmap() have yet to be reviewed carefully though before removing the lock there. Just keep the it around these two calls for safety. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10reiserfs: delay reiserfs lock until journal initializationFrederic Weisbecker
In the mount path, transactions that are made before journal initialization don't involve the filesystem. We can delay the reiserfs lock until we play with the journal. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10reiserfs: delete comments referring to the BKLDavidlohr Bueso
Signed-off-by: Davidlohr Bueso <dave@gnu.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10fs: binfmt_elf: create Kconfig variable for PIE randomizationDavid Daney
Randomization of PIE load address is hard coded in binfmt_elf.c for X86 and ARM. Create a new Kconfig variable (CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE) for this and use it instead. Thus architecture specific policy is pushed out of the generic binfmt_elf.c and into the architecture Kconfig files. X86 and ARM Kconfigs are modified to select the new variable so there is no change in behavior. A follow on patch will select it for MIPS too. Signed-off-by: David Daney <david.daney@cavium.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10tracepoint: add tracepoints for debugging oom_score_adjKAMEZAWA Hiroyuki
oom_score_adj is used for guarding processes from OOM-Killer. One of problem is that it's inherited at fork(). When a daemon set oom_score_adj and make children, it's hard to know where the value is set. This patch adds some tracepoints useful for debugging. This patch adds 3 trace points. - creating new task - renaming a task (exec) - set oom_score_adj To debug, users need to enable some trace pointer. Maybe filtering is useful as # EVENT=/sys/kernel/debug/tracing/events/task/ # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter # echo "oom_score_adj != 0" > $EVENT/task_rename/filter # echo 1 > $EVENT/enable # EVENT=/sys/kernel/debug/tracing/events/oom/ # echo 1 > $EVENT/enable output will be like this. # grep oom /sys/kernel/debug/tracing/trace bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000 bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000 ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000 bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000 grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10btrfs: pass __GFP_WRITE for buffered write page allocationsJohannes Weiner
Tell the page allocator that pages allocated for a buffered write are expected to become dirty soon. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10mm: account reaped page cache on inode cache pruningKonstantin Khlebnikov
Inode cache pruning indirectly reclaims page-cache by invalidating mapping pages. Let's account them into reclaim-state to notice this progress in memory reclaimer. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Ext4 commits for 3.3 merge window * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits) ext4: fix undefined behavior in ext4_fill_flex_info() ext4: make more symbols static ext4: make local symbol ext4_initxattrs static jbd2: fix hung processes in jbd2_journal_lock_updates() ext4: reserve new feature flag codepoints ext4: Report max_batch_time option correctly ext4: add missing ext4_resize_end on error paths ext4: let ext4_group_add() use common code ext4: let ext4_group_extend() use common code ext4: add new online resize interface ext4: add a new function which adds a flex group to a fs ext4: add a new function which allocates bitmaps and inode tables ext4: pass verify_reserved_gdb() the number of group decriptors ext4: add a function which updates the super block during online resizing ext4: add a function which sets up a block group descriptors of a flex bg ext4: add a function which sets up group blocks of a flex bg ext4: add a structure which will be used by 64bit-resize interface ext4: add a function which adds a new group descriptors to a fs ext4: add a function which extends a group without checking parameters ext4: use proper little-endian bitops ...
2012-01-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: fs/9p: iattr_valid flags are kernel internal flags map them to 9p values. fs/9p: We should not allocate a new inode when creating hardlines. fs/9p: v9fs_stat2inode should update suid/sgid bits. 9p: Reduce object size with CONFIG_NET_9P_DEBUG fs/9p: check schedule_timeout_interruptible return value Fix up trivial conflicts in fs/9p/{vfs_inode.c,vfs_inode_dotl.c} due to debug messages having changed to use p9_debug() on one hand, and the changes for umode_t on the other.
2012-01-10Merge branch 'nfs-for-3.3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
* 'nfs-for-3.3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Change the default setting of the nfs4_disable_idmapping parameter NFSv4: Save the owner/group name string when doing open NFS: Remove pNFS bloat from the generic write path pnfs-obj: Must return layout on IO error pnfs-obj: pNFS errors are communicated on iodata->pnfs_error NFS: Cache state owners after files are closed NFS: Clean up nfs4_find_state_owners_locked() NFSv4: include bitmap in nfsv4 get acl data nfs: fix a minor do_div portability issue NFSv4.1: cleanup comment and debug printk NFSv4.1: change nfs4_free_slot parameters for dynamic slots NFSv4.1: cleanup init and reset of session slot tables NFSv4.1: fix backchannel slotid off-by-one bug nfs: fix regression in handling of context= option in NFSv4 NFS - fix recent breakage to NFS error handling. NFS: Retry mounting NFSROOT SUNRPC: Clean up the RPCSEC_GSS service ticket requests