From 7d2ffe80aa000a149246b3745968634192eb5358 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 27 May 2013 16:38:23 +1000 Subject: xfs: fix split buffer vector log recovery support A long time ago in a galaxy far away.... .. the was a commit made to fix some ilinux specific "fragmented buffer" log recovery problem: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603 That problem occurred when a contiguous dirty region of a buffer was split across across two pages of an unmapped buffer. It's been a long time since that has been done in XFS, and the changes to log the entire inode buffers for CRC enabled filesystems has re-introduced that corner case. And, of course, it turns out that the above commit didn't actually fix anything - it just ensured that log recovery is guaranteed to fail when this situation occurs. And now for the gory details. xfstest xfs/085 is failing with this assert: XFS (vdb): bad number of regions (0) in inode log format XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583 Largely undocumented factoid #1: Log recovery depends on all log buffer format items starting with this format: struct foo_log_format { __uint16_t type; __uint16_t size; .... As recoery uses the size field and assumptions about 32 bit alignment in decoding format items. So don't pay much attention to the fact log recovery thinks that it decoding an inode log format item - it just uses them to determine what the size of the item is. But why would it see a log format item with a zero size? Well, luckily enough xfs_logprint uses the same code and gives the same error, so with a bit of gdb magic, it turns out that it isn't a log format that is being decoded. What logprint tells us is this: Oper (130): tid: a0375e1a len: 28 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 144 (0x90) len: 16 bmap size: 2 flags: 0x4000 Oper (131): tid: a0375e1a len: 4096 clientid: TRANS flags: none BUF DATA ---------------------------------------------------------------------------- Oper (132): tid: a0375e1a len: 4096 clientid: TRANS flags: none xfs_logprint: unknown log operation type (4e49) ********************************************************************** * ERROR: data block=2 * ********************************************************************** That we've got a buffer format item (oper 130) that has two regions; the format item itself and one dirty region. The subsequent region after the buffer format item and it's data is them what we are tripping over, and the first bytes of it at an inode magic number. Not a log opheader like there is supposed to be. That means there's a problem with the buffer format item. It's dirty data region is 4096 bytes, and it contains - you guessed it - initialised inodes. But inode buffers are 8k, not 4k, and we log them in their entirety. So something is wrong here. The buffer format item contains: (gdb) p /x *(struct xfs_buf_log_format *)in_f $22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000, blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2, blf_data_map = {0xffffffff, 0xffffffff, .... }} Two regions, and a signle dirty contiguous region of 64 bits. 64 * 128 = 8k, so this should be followed by a single 8k region of data. And the blf_flags tell us that the type of buffer is a XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation buffer. So, it should be followed by 8k of inode data. But we know that the next region has a header of: (gdb) p /x *ohead $25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69, oh_flags = 0x0, oh_res2 = 0x0} and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not long enough to hold all the logged data. There must be another region. There is - there's a following opheader for another 4k of data that contains the other half of the inode cluster data - the one we assert fail on because it's not a log format header. So why is the second part of the data not being accounted to the correct buffer log format structure? It took a little more work with gdb to work out that the buffer log format structure was both expecting it to be there but hadn't accounted for it. It was at that point I went to the kernel code, as clearly this wasn't a bug in xfs_logprint and the kernel was writing bad stuff to the log. First port of call was the buffer item formatting code, and the discontiguous memory/contiguous dirty region handling code immediately stood out. I've wondered for a long time why the code had this comment in it: vecp->i_addr = xfs_buf_offset(bp, buffer_offset); vecp->i_len = nbits * XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; /* * You would think we need to bump the nvecs here too, but we do not * this number is used by recovery, and it gets confused by the boundary * split here * nvecs++; */ vecp++; And it didn't account for the extra vector pointer. The case being handled here is that a contiguous dirty region lies across a boundary that cannot be memcpy()d across, and so has to be split into two separate operations for xlog_write() to perform. What this code assumes is that what is written to the log is two consecutive blocks of data that are accounted in the buf log format item as the same contiguous dirty region and so will get decoded as such by the log recovery code. The thing is, xlog_write() knows nothing about this, and so just does it's normal thing of adding an opheader for each vector. That means the 8k region gets written to the log as two separate regions of 4k each, but because nvecs has not been incremented, the buf log format item accounts for only one of them. Hence when we come to log recovery, we process the first 4k region and then expect to come across a new item that starts with a log format structure of some kind that tells us whenteh next data is going to be. Instead, we hit raw buffer data and things go bad real quick. So, the commit from 2002 that commented out nvecs++ is just plain wrong. It breaks log recovery completely, and it would seem the only reason this hasn't been since then is that we don't log large contigous regions of multi-page unmapped buffers very often. Never would be a closer estimate, at least until the CRC code came along.... So, lets fix that by restoring the nvecs accounting for the extra region when we hit this case..... .... and there's the problemin log recovery it is apparently working around: XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135 Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty regions being broken up into multiple regions by the log formatting code. That's an easy fix, though - if the number of contiguous dirty bits exceeds the length of the region being copied out of the log, only account for the number of dirty bits that region covers, and then loop again and copy more from the next region. It's a 2 line fix. Now xfstests xfs/085 passes, we have one less piece of mystery code, and one more important piece of knowledge about how to structure new log format items.. Signed-off-by: Dave Chinner Reviewed-by: Mark Tinguely Signed-off-by: Ben Myers (cherry picked from commit 709da6a61aaf12181a8eea8443919ae5fc1b731d) --- fs/xfs/xfs_log_recover.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'fs/xfs/xfs_log_recover.c') diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 93f03ec17ee..d9e4d3c3991 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -2096,6 +2096,17 @@ xlog_recover_do_reg_buffer( ASSERT(BBTOB(bp->b_io_length) >= ((uint)bit << XFS_BLF_SHIFT) + (nbits << XFS_BLF_SHIFT)); + /* + * The dirty regions logged in the buffer, even though + * contiguous, may span multiple chunks. This is because the + * dirty region may span a physical page boundary in a buffer + * and hence be split into two separate vectors for writing into + * the log. Hence we need to trim nbits back to the length of + * the current region being copied out of the log. + */ + if (item->ri_buf[i].i_len < (nbits << XFS_BLF_SHIFT)) + nbits = item->ri_buf[i].i_len >> XFS_BLF_SHIFT; + /* * Do a sanity check if this is a dquot buffer. Just checking * the first dquot in the buffer should do. XXXThis is -- cgit v1.2.3 From bb9b8e86ad083ecb2567ae909c1d6cb0bbaa60fe Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 3 Jun 2013 15:28:46 +1000 Subject: xfs: rework dquot CRCs Calculating dquot CRCs when the backing buffer is written back just doesn't work reliably. There are several places which manipulate dquots directly in the buffers, and they don't calculate CRCs appropriately, nor do they always set the buffer up to calculate CRCs appropriately. Firstly, if we log a dquot buffer (e.g. during allocation) it gets logged without valid CRC, and so on recovery we end up with a dquot that is not valid. Secondly, if we recover/repair a dquot, we don't have a verifier attached to the buffer and hence CRCs are not calculated on the way down to disk. Thirdly, calculating the CRC after we've changed the contents means that if we re-read the dquot from the buffer, we cannot verify the contents of the dquot are valid, as the CRC is invalid. So, to avoid all the dquot CRC errors that are being detected by the read verifier, change to using the same model as for inodes. That is, dquot CRCs are calculated and written to the backing buffer at the time the dquot is flushed to the backing buffer. If we modify the dquot directly in the backing buffer, calculate the CRC immediately after the modification is complete. Hence the dquot in the on-disk buffer should always have a valid CRC. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Ben Myers Signed-off-by: Ben Myers (cherry picked from commit 6fcdc59de28817d1fbf1bd58cc01f4f3fac858fb) --- fs/xfs/xfs_log_recover.c | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'fs/xfs/xfs_log_recover.c') diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index d9e4d3c3991..d6204d1ac47 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -2266,6 +2266,12 @@ xfs_qm_dqcheck( d->dd_diskdq.d_flags = type; d->dd_diskdq.d_id = cpu_to_be32(id); + if (xfs_sb_version_hascrc(&mp->m_sb)) { + uuid_copy(&d->dd_uuid, &mp->m_sb.sb_uuid); + xfs_update_cksum((char *)d, sizeof(struct xfs_dqblk), + XFS_DQUOT_CRC_OFF); + } + return errs; } @@ -2793,6 +2799,10 @@ xlog_recover_dquot_pass2( } memcpy(ddq, recddq, item->ri_buf[1].i_len); + if (xfs_sb_version_hascrc(&mp->m_sb)) { + xfs_update_cksum((char *)ddq, sizeof(struct xfs_dqblk), + XFS_DQUOT_CRC_OFF); + } ASSERT(dq_f->qlf_size == 2); ASSERT(bp->b_target->bt_mount == mp); -- cgit v1.2.3 From 75406170751b4de88a01f73dda56efa617ddd5d7 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 5 Jun 2013 12:09:07 +1000 Subject: xfs: fix log recovery transaction item reordering There are several constraints that inode allocation and unlink logging impose on log recovery. These all stem from the fact that inode alloc/unlink are logged in buffers, but all other inode changes are logged in inode items. Hence there are ordering constraints that recovery must follow to ensure the correct result occurs. As it turns out, this ordering has been working mostly by chance than good management. The existing code moves all buffers except cancelled buffers to the head of the list, and everything else to the tail of the list. The problem with this is that is interleaves inode items with the buffer cancellation items, and hence whether the inode item in an cancelled buffer gets replayed is essentially left to chance. Further, this ordering causes problems for log recovery when inode CRCs are enabled. It typically replays the inode unlink buffer long before it replays the inode core changes, and so the CRC recorded in an unlink buffer is going to be invalid and hence any attempt to validate the inode in the buffer is going to fail. Hence we really need to enforce the ordering that the inode alloc/unlink code has expected log recovery to have since inode chunk de-allocation was introduced back in 2003... Signed-off-by: Dave Chinner Reviewed-by: Mark Tinguely Signed-off-by: Ben Myers (cherry picked from commit a775ad778073d55744ed6709ccede36310638911) --- fs/xfs/xfs_log_recover.c | 65 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 58 insertions(+), 7 deletions(-) (limited to 'fs/xfs/xfs_log_recover.c') diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index d6204d1ac47..83088d96e6c 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1599,10 +1599,43 @@ xlog_recover_add_to_trans( } /* - * Sort the log items in the transaction. Cancelled buffers need - * to be put first so they are processed before any items that might - * modify the buffers. If they are cancelled, then the modifications - * don't need to be replayed. + * Sort the log items in the transaction. + * + * The ordering constraints are defined by the inode allocation and unlink + * behaviour. The rules are: + * + * 1. Every item is only logged once in a given transaction. Hence it + * represents the last logged state of the item. Hence ordering is + * dependent on the order in which operations need to be performed so + * required initial conditions are always met. + * + * 2. Cancelled buffers are recorded in pass 1 in a separate table and + * there's nothing to replay from them so we can simply cull them + * from the transaction. However, we can't do that until after we've + * replayed all the other items because they may be dependent on the + * cancelled buffer and replaying the cancelled buffer can remove it + * form the cancelled buffer table. Hence they have tobe done last. + * + * 3. Inode allocation buffers must be replayed before inode items that + * read the buffer and replay changes into it. + * + * 4. Inode unlink buffers must be replayed after inode items are replayed. + * This ensures that inodes are completely flushed to the inode buffer + * in a "free" state before we remove the unlinked inode list pointer. + * + * Hence the ordering needs to be inode allocation buffers first, inode items + * second, inode unlink buffers third and cancelled buffers last. + * + * But there's a problem with that - we can't tell an inode allocation buffer + * apart from a regular buffer, so we can't separate them. We can, however, + * tell an inode unlink buffer from the others, and so we can separate them out + * from all the other buffers and move them to last. + * + * Hence, 4 lists, in order from head to tail: + * - buffer_list for all buffers except cancelled/inode unlink buffers + * - item_list for all non-buffer items + * - inode_buffer_list for inode unlink buffers + * - cancel_list for the cancelled buffers */ STATIC int xlog_recover_reorder_trans( @@ -1612,6 +1645,10 @@ xlog_recover_reorder_trans( { xlog_recover_item_t *item, *n; LIST_HEAD(sort_list); + LIST_HEAD(cancel_list); + LIST_HEAD(buffer_list); + LIST_HEAD(inode_buffer_list); + LIST_HEAD(inode_list); list_splice_init(&trans->r_itemq, &sort_list); list_for_each_entry_safe(item, n, &sort_list, ri_list) { @@ -1619,12 +1656,18 @@ xlog_recover_reorder_trans( switch (ITEM_TYPE(item)) { case XFS_LI_BUF: - if (!(buf_f->blf_flags & XFS_BLF_CANCEL)) { + if (buf_f->blf_flags & XFS_BLF_CANCEL) { trace_xfs_log_recover_item_reorder_head(log, trans, item, pass); - list_move(&item->ri_list, &trans->r_itemq); + list_move(&item->ri_list, &cancel_list); break; } + if (buf_f->blf_flags & XFS_BLF_INODE_BUF) { + list_move(&item->ri_list, &inode_buffer_list); + break; + } + list_move_tail(&item->ri_list, &buffer_list); + break; case XFS_LI_INODE: case XFS_LI_DQUOT: case XFS_LI_QUOTAOFF: @@ -1632,7 +1675,7 @@ xlog_recover_reorder_trans( case XFS_LI_EFI: trace_xfs_log_recover_item_reorder_tail(log, trans, item, pass); - list_move_tail(&item->ri_list, &trans->r_itemq); + list_move_tail(&item->ri_list, &inode_list); break; default: xfs_warn(log->l_mp, @@ -1643,6 +1686,14 @@ xlog_recover_reorder_trans( } } ASSERT(list_empty(&sort_list)); + if (!list_empty(&buffer_list)) + list_splice(&buffer_list, &trans->r_itemq); + if (!list_empty(&inode_list)) + list_splice_tail(&inode_list, &trans->r_itemq); + if (!list_empty(&inode_buffer_list)) + list_splice_tail(&inode_buffer_list, &trans->r_itemq); + if (!list_empty(&cancel_list)) + list_splice_tail(&cancel_list, &trans->r_itemq); return 0; } -- cgit v1.2.3 From ad868afddb908a5d4015c6b7637721b48fb9c8f9 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 5 Jun 2013 12:09:08 +1000 Subject: xfs: inode unlinked list needs to recalculate the inode CRC The inode unlinked list manipulations operate directly on the inode buffer, and so bypass the inode CRC calculation mechanisms. Hence an inode on the unlinked list has an invalid CRC. Fix this by recalculating the CRC whenever we modify an unlinked list pointer in an inode, ncluding during log recovery. This is trivial to do and results in unlinked list operations always leaving a consistent inode in the buffer. Signed-off-by: Dave Chinner Reviewed-by: Mark Tinguely Signed-off-by: Ben Myers (cherry picked from commit 0a32c26e720a8b38971d0685976f4a7d63f9e2ef) --- fs/xfs/xfs_log_recover.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'fs/xfs/xfs_log_recover.c') diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 83088d96e6c..45a85ff84da 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1912,6 +1912,15 @@ xlog_recover_do_inode_buffer( buffer_nextp = (xfs_agino_t *)xfs_buf_offset(bp, next_unlinked_offset); *buffer_nextp = *logged_nextp; + + /* + * If necessary, recalculate the CRC in the on-disk inode. We + * have to leave the inode in a consistent state for whoever + * reads it next.... + */ + xfs_dinode_calc_crc(mp, (struct xfs_dinode *) + xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize)); + } return 0; -- cgit v1.2.3 From d302cf1d316dca5f567e89872cf5d475c9a55f74 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Wed, 12 Jun 2013 12:19:06 +1000 Subject: xfs: don't shutdown log recovery on validation errors Unfortunately, we cannot guarantee that items logged multiple times and replayed by log recovery do not take objects back in time. When they are taken back in time, the go into an intermediate state which is corrupt, and hence verification that occurs on this intermediate state causes log recovery to abort with a corruption shutdown. Instead of causing a shutdown and unmountable filesystem, don't verify post-recovery items before they are written to disk. This is less than optimal, but there is no way to detect this issue for non-CRC filesystems If log recovery successfully completes, this will be undone and the object will be consistent by subsequent transactions that are replayed, so in most cases we don't need to take drastic action. For CRC enabled filesystems, leave the verifiers in place - we need to call them to recalculate the CRCs on the objects anyway. This recovery problem can be solved for such filesystems - we have a LSN stamped in all metadata at writeback time that we can to determine whether the item should be replayed or not. This is a separate piece of work, so is not addressed by this patch. Signed-off-by: Dave Chinner Reviewed-by: Ben Myers Signed-off-by: Ben Myers (cherry picked from commit 9222a9cf86c0d64ffbedf567412b55da18763aa3) --- fs/xfs/xfs_log_recover.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) (limited to 'fs/xfs/xfs_log_recover.c') diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 45a85ff84da..7cf5e4eafe2 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1845,7 +1845,13 @@ xlog_recover_do_inode_buffer( xfs_agino_t *buffer_nextp; trace_xfs_log_recover_buf_inode_buf(mp->m_log, buf_f); - bp->b_ops = &xfs_inode_buf_ops; + + /* + * Post recovery validation only works properly on CRC enabled + * filesystems. + */ + if (xfs_sb_version_hascrc(&mp->m_sb)) + bp->b_ops = &xfs_inode_buf_ops; inodes_per_buf = BBTOB(bp->b_io_length) >> mp->m_sb.sb_inodelog; for (i = 0; i < inodes_per_buf; i++) { @@ -2205,7 +2211,16 @@ xlog_recover_do_reg_buffer( /* Shouldn't be any more regions */ ASSERT(i == item->ri_total); - xlog_recovery_validate_buf_type(mp, bp, buf_f); + /* + * We can only do post recovery validation on items on CRC enabled + * fielsystems as we need to know when the buffer was written to be able + * to determine if we should have replayed the item. If we replay old + * metadata over a newer buffer, then it will enter a temporarily + * inconsistent state resulting in verification failures. Hence for now + * just avoid the verification stage for non-crc filesystems + */ + if (xfs_sb_version_hascrc(&mp->m_sb)) + xlog_recovery_validate_buf_type(mp, bp, buf_f); } /* -- cgit v1.2.3