aboutsummaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
2010-05-19xfs: update and factor xfs_trans_committed()Dave Chinner
The function header to xfs-trans_committed has long had this comment: * THIS SHOULD BE REWRITTEN TO USE xfs_trans_next_item() To prepare for different methods of committing items, convert the code to use xfs_trans_next_item() and factor the code into smaller, more digestible chunks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: clean up xfs_trans_commit logic even moreChristoph Hellwig
> +shut_us_down: > + shutdown = XFS_FORCED_SHUTDOWN(mp) ? EIO : 0; > + if (!(tp->t_flags & XFS_TRANS_DIRTY) || shutdown) { > + xfs_trans_unreserve_and_mod_sb(tp); > + /* This whole area in _xfs_trans_commit is still a complete mess. So while touching this code, unravel this mess as well to make the whole flow of the function simpler and clearer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19xfs: split out iclog writing from xfs_trans_commit()Dave Chinner
Split the the part of xfs_trans_commit() that deals with writing the transaction into the iclog into a separate function. This isolates the physical commit process from the logical commit operation and makes it easier to insert different transaction commit paths without affecting the existing algorithm adversely. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: fix reservation release commit flag in xfs_bmap_add_attrfork()Dave Chinner
xfs_bmap_add_attrfork() passes XFS_TRANS_PERM_LOG_RES to xfs_trans_commit() to indicate that the commit should release the permanent log reservation as part of the commit. This is wrong - the correct flag is XFS_TRANS_RELEASE_LOG_RES - and it is only by the chance that both these flags have the value of 0x4 that the code is doing the right thing. Fix it by changing to use the correct flag. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: remove stale parameter from ->iop_unpin methodDave Chinner
The staleness of a object being unpinned can be directly derived from the object itself - there is no need to extract it from the object then pass it as a parameter into IOP_UNPIN(). This means we can kill the XFS_LID_BUF_STALE flag - it is set, checked and cleared in the same places XFS_BLI_STALE flag in the xfs_buf_log_item so it is now redundant and hence safe to remove. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: Add inode pin counts to tracesDave Chinner
We don't record pin counts in inode events right now, and this makes it difficult to track down problems related to pinning inodes. Add the pin count to the inode trace class and add trace events for pinning and unpinning inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: factor log item initialisationDave Chinner
Each log item type does manual initialisation of the log item. Delayed logging introduces new fields that need initialisation, so factor all the open coded initialisation into a common function first. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-05-19xfs: add blockdev name to kthreadsJan Engelhardt
This allows to see in `ps` and similar tools which kthreads are allotted to which block device/filesystem, similar to what jbd2 does. As the process name is a fixed 16-char array, no extra space is needed in tasks. PID TTY STAT TIME COMMAND 2 ? S 0:00 [kthreadd] 197 ? S 0:00 \_ [jbd2/sda2-8] 198 ? S 0:00 \_ [ext4-dio-unwrit] 204 ? S 0:00 \_ [flush-8:0] 2647 ? S 0:00 \_ [xfs_mru_cache] 2648 ? S 0:00 \_ [xfslogd/0] 2649 ? S 0:00 \_ [xfsdatad/0] 2650 ? S 0:00 \_ [xfsconvertd/0] 2651 ? S 0:00 \_ [xfsbufd/ram0] 2652 ? S 0:00 \_ [xfsaild/ram0] 2653 ? S 0:00 \_ [xfssyncd/ram0] Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-05-19xfs: Fix integer overflow in fs/xfs/linux-2.6/xfs_ioctl*.cZhitong Wang
The am_hreq.opcount field in the xfs_attrmulti_by_handle() interface is not bounded correctly. The opcount is used to determine the size of the buffer required. The size is bounded, but can overflow and so the size checks may not be sufficient to catch invalid opcounts. Fix it by catching opcount values that would cause overflows before calculating the size. Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com> Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-04-29xfs: add a shrinker to background inode reclaimDave Chinner
On low memory boxes or those with highmem, kernel can OOM before the background reclaims inodes via xfssyncd. Add a shrinker to run inode reclaim so that it inode reclaim is expedited when memory is low. This is more complex than it needs to be because the VM folk don't want a context added to the shrinker infrastructure. Hence we need to add a global list of XFS mount structures so the shrinker can traverse them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-04-29Merge branch 'master' into for-2.6.35Jens Axboe
Conflicts: fs/block_dev.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-28blkdev: generalize flags for blkdev_issue_fn functionsDmitry Monakhov
The patch just convert all blkdev_issue_xxx function to common set of flags. Wait/allocation semantics preserved. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-26xfs: more swap extent fixes for dynamic fork offsetsDave Chinner
A new xfsqa test (226) with a prototype xfs_fsr change to try to handle dynamic fork offsets better triggers an assertion failure where the inode data fork is in btree format, yet there is room in the inode for it to be in extent format. The two inodes look like: before: ino 0x101 (target), num_extents 11, Max in-fork extents 6, broot size 40, fork offset 96 before: ino 0x115 (temp), num_extents 5, Max in-fork extents 3, broot size 40, fork offset 56 after: ino 0x101 (target), num_extents 5, Max in-fork extents 6, broot size 40, fork offset 96 after: ino 0x115 (temp), num_extents 11, Max in-fork extents 3, broot size 40, fork offset 56 Basically the target inode ends up with 5 extents in btree format, but it had space for 6 extents in extent format, so ends up incorrect. Notably here the broot size is the same, and that is where the kernel code is going wrong - the btree root will fit, so it lets the swap go ahead. The check should not allow the swap to take place if the number of extents while in btree format is less than the number of extents that can fit in the inode in extent format. Adding that check will prevent this swap and corruption from occurring. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-04-16xfs: don't warn on EAGAIN in inode reclaimDave Chinner
Any inode reclaim flush that returns EAGAIN will result in the inode reclaim being attempted again later. There is no need to issue a warning into the logs about this situation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-04-16xfs: ensure that sync updates the log tail correctlyDave Chinner
Updates to the VFS layer removed an extra ->sync_fs call into the filesystem during the sync process (from the quota code). Unfortunately the sync code was unknowingly relying on this call to make sure metadata buffers were flushed via a xfs_buftarg_flush() call to move the tail of the log forward in memory before the final transactions of the sync process were issued. As a result, the old code would write a very recent log tail value to the log by the end of the sync process, and so a subsequent crash would leave nothing for log recovery to do. Hence in qa test 182, log recovery only replayed a small handle for inode fsync transactions in this case. However, with the removal of the extra ->sync_fs call, the log tail was now not moved forward with the inode fsync transactions near the end of the sync procese the first (and only) buftarg flush occurred after these transactions went to disk. The result is that log recovery now sees a large number of transactions for metadata that is already on disk. This usually isn't a problem, but when the transactions include inode chunk allocation, the inode create transactions and all subsequent changes are replayed as we cannt rely on what is on disk is valid. As a result, if the inode was written and contains unlogged changes, the unlogged changes are lost, thereby violating sync semantics. The fix is to always issue a transaction after the buftarg flush occurs is the log iѕ not idle or covered. This results in a dummy transaction being written that contains the up-to-date log tail value, which will be very recent. Indeed, it will be at least as recent as the old code would have left on disk, so log recovery will behave exactly as it used to in this situation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-16xfs: don't warn about page discards on shutdownDave Chinner
If we are doing a forced shutdown, we can get lots of noise about delalloc pages being discarded. This is happens by design during a forced shutdown, so don't spam the logs with these messages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-16xfs: use scalable vmap APIAlex Elder
Re-apply a commit that had been reverted due to regressions that have since been fixed. From 95f8e302c04c0b0c6de35ab399a5551605eeb006 Mon Sep 17 00:00:00 2001 From: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:43:09 +1100 Implement XFS's large buffer support with the new vmap APIs. See the vmap rewrite (db64fe02) for some numbers. The biggest improvement that comes from using the new APIs is avoiding the global KVA allocation lock on every call. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Only modifications here were a minor reformat, plus making the patch apply given the new use of xfs_buf_is_vmapped(). Modified-by: Alex Elder <aelder@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-16xfs: remove old vmap cacheAlex Elder
Re-apply a commit that had been reverted due to regressions that have since been fixed. Original commit: d2859751cd0bf586941ffa7308635a293f943c17 Author: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:40:44 +1100 XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps track of them in a list. To purge the batch, it just goes through the list and calls vunamp on each one. This is pretty poor: a global TLB flush is generally still performed on each vunmap, with the most expensive parts of the operation being the broadcast IPIs and locking involved in the SMP callouts, and the locking involved in the vmap management -- none of these are avoided by just batching up the calls. I'm actually surprised it ever made much difference. (Now that the lazy vmap allocator is upstream, this description is not quite right, but the vunmap batching still doesn't seem to do much). Rip all this logic out of XFS completely. I will improve vmap performance and scalability directly in subsequent patch. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> The only change I made was to use the "new" xfs_buf_is_vmapped() function in a place it had been open-coded in the original. Modified-by: Alex Elder <aelder@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-06Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (21 commits) xfs: return inode fork offset in bulkstat for fsr xfs: Increase the default size of the reserved blocks pool xfs: truncate delalloc extents when IO fails in writeback xfs: check for more work before sleeping in xfssyncd xfs: Fix a build warning in xfs_aops.c xfs: fix locking for inode cache radix tree tag updates xfs: remove xfs_ipin/xfs_iunpin xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowait xfs: kill xfs_lrw.h xfs: factor common xfs_trans_bjoin code xfs: stop passing opaque handles to xfs_log.c routines xfs: split xfs_bmap_btalloc xfs: fix xfs_fsblock_t tracing xfs: fix inode pincount check in fsync xfs: Non-blocking inode locking in IO completion xfs: implement optimized fdatasync xfs: remove wrapper for the fsync file operation xfs: remove wrappers for read/write file operations xfs: merge xfs_lrw.c into xfs_file.c xfs: fix dquota trace format ...
2010-03-06Merge branch 'for-2.6.34' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
* 'for-2.6.34' of git://linux-nfs.org/~bfields/linux: (22 commits) nfsd4: fix minor memory leak svcrpc: treat uid's as unsigned nfsd: ensure sockets are closed on error Revert "sunrpc: move the close processing after do recvfrom method" Revert "sunrpc: fix peername failed on closed listener" sunrpc: remove unnecessary svc_xprt_put NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN xfs_export_operations.commit_metadata commit_metadata export operation replacing nfsd_sync_dir lockd: don't clear sm_monitored on nsm_reboot_lookup lockd: release reference to nsm_handle in nlm_host_rebooted nfsd: Use vfs_fsync_range() in nfsd_commit NFSD: Create PF_INET6 listener in write_ports SUNRPC: NFS kernel APIs shouldn't return ENOENT for "transport not found" SUNRPC: Bury "#ifdef IPV6" in svc_create_xprt() NFSD: Support AF_INET6 in svc_addsock() function SUNRPC: Use rpc_pton() in ip_map_parse() nfsd: 4.1 has an rfc number nfsd41: Create the recovery entry for the NFSv4.1 client nfsd: use vfs_fsync for non-directories ...
2010-03-05Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits) quota: stop using QUOTA_OK / NO_QUOTA dquot: cleanup dquot initialize routine dquot: move dquot initialization responsibility into the filesystem dquot: cleanup dquot drop routine dquot: move dquot drop responsibility into the filesystem dquot: cleanup dquot transfer routine dquot: move dquot transfer responsibility into the filesystem dquot: cleanup inode allocation / freeing routines dquot: cleanup space allocation / freeing routines ext3: add writepage sanity checks ext3: Truncate allocated blocks if direct IO write fails to update i_size quota: Properly invalidate caches even for filesystems with blocksize < pagesize quota: generalize quota transfer interface quota: sb_quota state flags cleanup jbd: Delay discarding buffers in journal_unmap_buffer ext3: quota_write cross block boundary behaviour quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota quota: split out compat_sys_quotactl support from quota.c quota: split out netlink notification support from quota.c quota: remove invalid optimization from quota_sync_all ... Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
2010-03-05pass writeback_control to ->write_inodeChristoph Hellwig
This gives the filesystem more information about the writeback that is happening. Trond requested this for the NFS unstable write handling, and other filesystems might benefit from this too by beeing able to distinguish between the different callers in more detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05make sure data is on disk before calling ->write_inodeChristoph Hellwig
Similar to the fsync issue fixed a while ago in commit 2daea67e966dc0c42067ebea015ddac6834cef88 we need to write for data to actually hit the disk before writing out the metadata to guarantee data integrity for filesystems that modify the inode in the data I/O completion path. Currently XFS and NFS handle this manually, and AFS has a write_inode method that does nothing but waiting for data, while others are possibly missing out on this. Fortunately this change has a lot less impact than the fsync change as none of the write_inode methods starts data writeout of any form by itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05Merge branch 'for-2.6.34-rc1-batch2' into for-linusAlex Elder
2010-03-05xfs: return inode fork offset in bulkstat for fsrDave Chinner
So that fsr can attempt to get the fork offset of the temporary inode it uses the same as the inode it is defragmenting, pass the fork offset out in the bulkstat information. The bulkstat structure has padding that has always been zeroed, so userspace can tell if this field is set or not by use of the xattr present flag and a non-zero value for the fork offset. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-05xfs: Increase the default size of the reserved blocks poolDave Chinner
The current default size of the reserved blocks pool is easy to deplete with certain workloads, in particular workloads that do lots of concurrent delayed allocation extent conversions. If enough transactions are running in parallel and the entire pool is consumed then subsequent calls to xfs_trans_reserve() will fail with ENOSPC. Also add a rate limited warning so we know if this starts happening again. This is an updated version of an old patch from Lachlan McIlroy. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-05xfs: truncate delalloc extents when IO fails in writebackDave Chinner
We currently use block_invalidatepage() to clean up pages where I/O fails in ->writepage(). Unfortunately, if the page has delalloc regions on it, we fail to remove the delalloc regions when we invalidate the page. This can result in tripping a BUG() in xfs_get_blocks() later on if a direct IO read is done on that same region - the delalloc extent is returned when none is supposed to be there. Fix this by truncating away the delalloc regions on the page before invalidating it. Because they are delalloc, we can do this without needing a transaction. Indeed - if we get ENOSPC errors, we have to be able to do this truncation without a transaction as there is no space left for block reservation (typically why we see a ENOSPC in writeback). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-05xfs: check for more work before sleeping in xfssyncdDave Chinner
xfssyncd processes a queue of work by detaching the queue and then iterating over all the work items. It then sleeps for a time period or until new work comes in. If new work is queued while xfssyncd is actively processing the detached work queue, it will not process that new work until after a sleep timeout or the next work event queued wakes it. Fix this by checking the work queue again before going to sleep. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-05xfs: Fix a build warning in xfs_aops.cDave Chinner
Fix a build warning that slipped through. Dave Chinner had posted an updated version of his patch but the previous version--without this fix--was what got committed. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-05quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquotaChristoph Hellwig
We already do these checks in the generic code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05quota: clean up Q_XQUOTASYNCChristoph Hellwig
Currently Q_XQUOTASYNC calls into the quota_sync method, but XFS does something entirely different in it than the rest of the filesystems. xfs_quota which calls Q_XQUOTASYNC expects an asynchronous data writeout to flush delayed allocations, while the "VFS" quota support wants to flush changes to the quota file. So make Q_XQUOTASYNC call into the writeback code directly and make the quota_sync method optional as XFS doesn't need in the sense expected by the rest of the quota code. GFS2 was using limited XFS-style quota and has a quota_sync method fitting neither the style used by vfs_quota_sync nor xfs_fs_quota_sync. I left it in for now as per discussion with Steve it expects to be called from the sync path this way. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-04Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs into for-2.6.34-incomingJ. Bruce Fields
Resolve merge conflict in fs/xfs/linux-2.6/xfs_export.c.
2010-03-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: add __percpu sparse annotations to what's left percpu: add __percpu sparse annotations to fs percpu: add __percpu sparse annotations to core kernel subsystems local_t: Remove leftover local.h this_cpu: Remove pageset_notifier this_cpu: Page allocator conversion percpu, x86: Generic inc / dec percpu instructions local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c module: Use this_cpu_xx to dynamically allocate counters local_t: Remove cpu_local_xx macros percpu: refactor the code in pcpu_[de]populate_chunk() percpu: remove compile warnings caused by __verify_pcpu_ptr() percpu: make accessors check for percpu pointer in sparse percpu: add __percpu for sparse. percpu: make access macros universal percpu: remove per_cpu__ prefix.
2010-03-01xfs: fix locking for inode cache radix tree tag updatesChristoph Hellwig
The radix-tree code requires it's users to serialize tag updates against other updates to the tree. While XFS protects tag updates against each other it does not serialize them against updates of the tree contents, which can lead to tag corruption. Fix the inode cache to always take pag_ici_lock in exclusive mode when updating radix tree tags. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Patrick Schreurs <patrick@news-service.com> Tested-by: Patrick Schreurs <patrick@news-service.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: remove xfs_ipin/xfs_iunpinChristoph Hellwig
Inodes are only pinned/unpinned via the inode item methods, and lots of code relies on that fact. So remove the separate xfs_ipin/xfs_iunpin helpers and merge them into their only callers. This also fixes up various duplicate and/or incorrect comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowaitChristoph Hellwig
Remove the inode item pointer and ili_last_lsn checks in __xfs_iunpin_wait as any pinned inode is guaranteed to have them valid. After this the xfs_iunpin_nowait case is nothing more than a xfs_log_force_lsn, as we know that the caller has already checked the pincount. Make xfs_iunpin_nowait the new low-level routine just doing the log force and rewrite xfs_iunpin_wait around it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: kill xfs_lrw.hChristoph Hellwig
Move the two declarations to better fitting headers now that xfs_lrw.c is gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: factor common xfs_trans_bjoin codeChristoph Hellwig
Most of xfs_trans_bjoin is duplicated in xfs_trans_get_buf, xfs_trans_getsb and xfs_trans_read_buf. Add a new _xfs_trans_bjoin which can be called by all four functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: stop passing opaque handles to xfs_log.c routinesChristoph Hellwig
Currenly we pass opaque xfs_log_ticket_t handles instead of struct xlog_ticket pointers, and void pointers instead of struct xlog_in_core pointers to various log manager functions. Instead pass properly typed pointers after adding forward declarations for them to xfs_log.h, and adjust the touched function prototypes to the standard XFS style while at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: split xfs_bmap_btallocChristoph Hellwig
Split out the nullfb case into a separate function to reduce the stack footprint and make the code more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: fix xfs_fsblock_t tracingChristoph Hellwig
Using a static buffer in xfs_fmtfsblock means we can corrupt traces if multiple CPUs hit this code path at the same. Just remove xfs_fmtfsblock for now and print the block number purely numerical. If we want the NULLFSBLOCK and NULLSTARTBLOCK formatting back the best way would be a decoding plugin in the trace-cmd userspace command. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: fix inode pincount check in fsyncChristoph Hellwig
We need to hold the ilock to check the inode pincount safely. While we're at it also remove the check for ip->i_itemp->ili_last_lsn, a pinned inode always has it set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: Non-blocking inode locking in IO completionDave Chinner
The introduction of barriers to loop devices has created a new IO order completion dependency that XFS does not handle. The loop device implements barriers using fsync and so turns a log IO in the XFS filesystem on the loop device into a data IO in the backing filesystem. That is, the completion of log IOs in the loop filesystem are now dependent on completion of data IO in the backing filesystem. This can cause deadlocks when a flush daemon issues a log force with an inode locked because the IO completion of IO on the inode is blocked by the inode lock. This in turn prevents further data IO completion from occuring on all XFS filesystems on that CPU (due to the shared nature of the completion queues). This then prevents the log IO from completing because the log is waiting for data IO completion as well. The fix for this new completion order dependency issue is to make the IO completion inode locking non-blocking. If the inode lock can't be grabbed, simply requeue the IO completion back to the work queue so that it can be processed later. This prevents the completion queue from being blocked and allows data IO completion on other inodes to proceed, hence avoiding completion order dependent deadlocks. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: implement optimized fdatasyncChristoph Hellwig
Allow us to track the difference between timestamp and size updates by using mark_inode_dirty from the I/O completion code, and checking the VFS inode flags in xfs_file_fsync. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: remove wrapper for the fsync file operationChristoph Hellwig
Currently the fsync file operation is divided into a low-level routine doing all the work and one that implements the Linux file operation and does minimal argument wrapping. This is a leftover from the days of the vnode operations layer and can be removed to simplify the code a bit, as well as preparing for the implementation of an optimized fdatasync which needs to look at the Linux inode state. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: remove wrappers for read/write file operationsChristoph Hellwig
Currently the aio_read, aio_write, splice_read and splice_write file operations are divided into a low-level routine doing all the work and one that implements the Linux file operations and does minimal argument wrapping. This is a leftover from the days of the vnode operations layer and can be removed to simplify the code a lot. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: merge xfs_lrw.c into xfs_file.cChristoph Hellwig
Currently the code to implement the file operations is split over two small files. Merge the content of xfs_lrw.c into xfs_file.c to have it in one place. Note that I haven't done various cleanups that are possible after this yet, they will follow in the next patch. Also the function xfs_dev_is_read_only which was in xfs_lrw.c before really doesn't fit in here at all and was moved to xfs_mount.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: fix dquota trace formatChristoph Hellwig
The be32_to_cpu in the TP_printk output breaks automatic parsing of the trace format by the trace-cmd tools, so we have to move it into the TP_assign block. While we're at it also fix the format for the quota limits to more regular and easier parseable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01xfs: increase readdir buffer sizeEric Sandeen
While doing some testing of readdir perf a while back, I noticed that the buffer size we're using internally is smaller than what glibc gives us by default. Upping this size helped a bit, and seems safe. glibc's __alloc_dir() does: const size_t default_allocation = (4 * BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : 4 * BUFSIZ); const size_t small_allocation = (BUFSIZ < sizeof (struct dirent64) ? sizeof (struct dirent64) : BUFSIZ); size_t allocation = default_allocation; #ifdef _STATBUF_ST_BLKSIZE if (statp != NULL && default_allocation < statp->st_blksize) allocation = statp->st_blksize; #endif and #define _G_BUFSIZ 8192 #define _IO_BUFSIZ _G_BUFSIZ # define BUFSIZ _IO_BUFSIZ so the default buffer is 4 * 8192 = 32768 (except in the unlikely case of blocks > 32k....) Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>