From 5953316dbf90067ebdeca626c34488bc166b73a8 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 23 May 2013 12:25:08 +0200 Subject: block: make rq->cmd_flags be 64-bit We have officially run out of flags in a 32-bit space. Extend it to 64-bit even on 32-bit archs. Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/blk-core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'block') diff --git a/block/blk-core.c b/block/blk-core.c index 0a00e4ecf87c..213e9f01c627 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -174,9 +174,9 @@ void blk_dump_rq_flags(struct request *rq, char *msg) { int bit; - printk(KERN_INFO "%s: dev %s: type=%x, flags=%x\n", msg, + printk(KERN_INFO "%s: dev %s: type=%x, flags=%llx\n", msg, rq->rq_disk ? rq->rq_disk->disk_name : "?", rq->cmd_type, - rq->cmd_flags); + (unsigned long long) rq->cmd_flags); printk(KERN_INFO " sector %llu, nr/cnr %u/%u\n", (unsigned long long)blk_rq_pos(rq), -- cgit v1.2.3 From 71fe07d040626de7b72244bf6de889c2e0f5aea3 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 4 Oct 2013 06:49:11 -0700 Subject: block: remove request ref_count This reference count has been around since before git history, but the only place where it's used is in blk_execute_rq, and ther it is entirely useless as it is incremented before submitting the request and decremented in the end_io handler before waking up the submitter thread. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/blk-core.c | 3 --- block/blk-exec.c | 7 ------- 2 files changed, 10 deletions(-) (limited to 'block') diff --git a/block/blk-core.c b/block/blk-core.c index 213e9f01c627..18faa7e81d3b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -145,7 +145,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->cmd = rq->__cmd; rq->cmd_len = BLK_MAX_CDB; rq->tag = -1; - rq->ref_count = 1; rq->start_time = jiffies; set_start_time_ns(rq); rq->part = NULL; @@ -1272,8 +1271,6 @@ void __blk_put_request(struct request_queue *q, struct request *req) { if (unlikely(!q)) return; - if (unlikely(--req->ref_count)) - return; blk_pm_put_request(req); diff --git a/block/blk-exec.c b/block/blk-exec.c index ae4f27d7944e..6b18d82d91c5 100644 --- a/block/blk-exec.c +++ b/block/blk-exec.c @@ -24,7 +24,6 @@ static void blk_end_sync_rq(struct request *rq, int error) struct completion *waiting = rq->end_io_data; rq->end_io_data = NULL; - __blk_put_request(rq->q, rq); /* * complete last, if this is a stack request the process (and thus @@ -103,12 +102,6 @@ int blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk, int err = 0; unsigned long hang_check; - /* - * we need an extra reference to the request, so we can look at - * it after io completion - */ - rq->ref_count++; - if (!rq->sense) { memset(sense, 0, sizeof(sense)); rq->sense = sense; -- cgit v1.2.3 From 320ae51feed5c2f13664aa05a76bec198967e04d Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 24 Oct 2013 09:20:05 +0100 Subject: blk-mq: new multi-queue block IO queueing mechanism Linux currently has two models for block devices: - The classic request_fn based approach, where drivers use struct request units for IO. The block layer provides various helper functionalities to let drivers share code, things like tag management, timeout handling, queueing, etc. - The "stacked" approach, where a driver squeezes in between the block layer and IO submitter. Since this bypasses the IO stack, driver generally have to manage everything themselves. With drivers being written for new high IOPS devices, the classic request_fn based driver doesn't work well enough. The design dates back to when both SMP and high IOPS was rare. It has problems with scaling to bigger machines, and runs into scaling issues even on smaller machines when you have IOPS in the hundreds of thousands per device. The stacked approach is then most often selected as the model for the driver. But this means that everybody has to re-invent everything, and along with that we get all the problems again that the shared approach solved. This commit introduces blk-mq, block multi queue support. The design is centered around per-cpu queues for queueing IO, which then funnel down into x number of hardware submission queues. We might have a 1:1 mapping between the two, or it might be an N:M mapping. That all depends on what the hardware supports. blk-mq provides various helper functions, which include: - Scalable support for request tagging. Most devices need to be able to uniquely identify a request both in the driver and to the hardware. The tagging uses per-cpu caches for freed tags, to enable cache hot reuse. - Timeout handling without tracking request on a per-device basis. Basically the driver should be able to get a notification, if a request happens to fail. - Optional support for non 1:1 mappings between issue and submission queues. blk-mq can redirect IO completions to the desired location. - Support for per-request payloads. Drivers almost always need to associate a request structure with some driver private command structure. Drivers can tell blk-mq this at init time, and then any request handed to the driver will have the required size of memory associated with it. - Support for merging of IO, and plugging. The stacked model gets neither of these. Even for high IOPS devices, merging sequential IO reduces per-command overhead and thus increases bandwidth. For now, this is provided as a potential 3rd queueing model, with the hope being that, as it matures, it can replace both the classic and stacked model. That would get us back to having just 1 real model for block devices, leaving the stacked approach to dm/md devices (as it was originally intended). Contributions in this patch from the following people: Shaohua Li Alexander Gordeev Christoph Hellwig Mike Christie Matias Bjorling Jeff Moyer Acked-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/Makefile | 5 +- block/blk-core.c | 142 +++-- block/blk-exec.c | 7 + block/blk-flush.c | 154 ++++- block/blk-mq-cpu.c | 93 ++++ block/blk-mq-cpumap.c | 108 ++++ block/blk-mq-sysfs.c | 384 +++++++++++++ block/blk-mq-tag.c | 204 +++++++ block/blk-mq-tag.h | 27 + block/blk-mq.c | 1480 +++++++++++++++++++++++++++++++++++++++++++++++++ block/blk-mq.h | 52 ++ block/blk-sysfs.c | 13 + block/blk-timeout.c | 73 ++- block/blk.h | 17 + 14 files changed, 2657 insertions(+), 102 deletions(-) create mode 100644 block/blk-mq-cpu.c create mode 100644 block/blk-mq-cpumap.c create mode 100644 block/blk-mq-sysfs.c create mode 100644 block/blk-mq-tag.c create mode 100644 block/blk-mq-tag.h create mode 100644 block/blk-mq.c create mode 100644 block/blk-mq.h (limited to 'block') diff --git a/block/Makefile b/block/Makefile index 671a83d063a5..20645e88fb57 100644 --- a/block/Makefile +++ b/block/Makefile @@ -5,8 +5,9 @@ obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-flush.o blk-settings.o blk-ioc.o blk-map.o \ blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \ - blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o \ - partition-generic.o partitions/ + blk-iopoll.o blk-lib.o blk-mq.o blk-mq-tag.o \ + blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \ + genhd.o scsi_ioctl.o partition-generic.o partitions/ obj-$(CONFIG_BLK_DEV_BSG) += bsg.o obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o diff --git a/block/blk-core.c b/block/blk-core.c index 18faa7e81d3b..3bb9e9f7f87e 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -48,7 +49,7 @@ DEFINE_IDA(blk_queue_ida); /* * For the allocated request tables */ -static struct kmem_cache *request_cachep; +struct kmem_cache *request_cachep = NULL; /* * For queue allocation @@ -60,42 +61,6 @@ struct kmem_cache *blk_requestq_cachep; */ static struct workqueue_struct *kblockd_workqueue; -static void drive_stat_acct(struct request *rq, int new_io) -{ - struct hd_struct *part; - int rw = rq_data_dir(rq); - int cpu; - - if (!blk_do_io_stat(rq)) - return; - - cpu = part_stat_lock(); - - if (!new_io) { - part = rq->part; - part_stat_inc(cpu, part, merges[rw]); - } else { - part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); - if (!hd_struct_try_get(part)) { - /* - * The partition is already being removed, - * the request will be accounted on the disk only - * - * We take a reference on disk->part0 although that - * partition will never be deleted, so we can treat - * it as any other partition. - */ - part = &rq->rq_disk->part0; - hd_struct_get(part); - } - part_round_stats(cpu, part); - part_inc_in_flight(part, rw); - rq->part = part; - } - - part_stat_unlock(); -} - void blk_queue_congestion_threshold(struct request_queue *q) { int nr; @@ -594,9 +559,12 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) if (!q) return NULL; + if (percpu_counter_init(&q->mq_usage_counter, 0)) + goto fail_q; + q->id = ida_simple_get(&blk_queue_ida, 0, 0, gfp_mask); if (q->id < 0) - goto fail_q; + goto fail_c; q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; @@ -643,6 +611,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) q->bypass_depth = 1; __set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags); + init_waitqueue_head(&q->mq_freeze_wq); + if (blkcg_init_queue(q)) goto fail_id; @@ -650,6 +620,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) fail_id: ida_simple_remove(&blk_queue_ida, q->id); +fail_c: + percpu_counter_destroy(&q->mq_usage_counter); fail_q: kmem_cache_free(blk_requestq_cachep, q); return NULL; @@ -1108,7 +1080,8 @@ retry: goto retry; } -struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask) +static struct request *blk_old_get_request(struct request_queue *q, int rw, + gfp_t gfp_mask) { struct request *rq; @@ -1125,6 +1098,14 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask) return rq; } + +struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask) +{ + if (q->mq_ops) + return blk_mq_alloc_request(q, rw, gfp_mask); + else + return blk_old_get_request(q, rw, gfp_mask); +} EXPORT_SYMBOL(blk_get_request); /** @@ -1210,7 +1191,7 @@ EXPORT_SYMBOL(blk_requeue_request); static void add_acct_request(struct request_queue *q, struct request *rq, int where) { - drive_stat_acct(rq, 1); + blk_account_io_start(rq, true); __elv_add_request(q, rq, where); } @@ -1299,12 +1280,17 @@ EXPORT_SYMBOL_GPL(__blk_put_request); void blk_put_request(struct request *req) { - unsigned long flags; struct request_queue *q = req->q; - spin_lock_irqsave(q->queue_lock, flags); - __blk_put_request(q, req); - spin_unlock_irqrestore(q->queue_lock, flags); + if (q->mq_ops) + blk_mq_free_request(req); + else { + unsigned long flags; + + spin_lock_irqsave(q->queue_lock, flags); + __blk_put_request(q, req); + spin_unlock_irqrestore(q->queue_lock, flags); + } } EXPORT_SYMBOL(blk_put_request); @@ -1340,8 +1326,8 @@ void blk_add_request_payload(struct request *rq, struct page *page, } EXPORT_SYMBOL_GPL(blk_add_request_payload); -static bool bio_attempt_back_merge(struct request_queue *q, struct request *req, - struct bio *bio) +bool bio_attempt_back_merge(struct request_queue *q, struct request *req, + struct bio *bio) { const int ff = bio->bi_rw & REQ_FAILFAST_MASK; @@ -1358,12 +1344,12 @@ static bool bio_attempt_back_merge(struct request_queue *q, struct request *req, req->__data_len += bio->bi_size; req->ioprio = ioprio_best(req->ioprio, bio_prio(bio)); - drive_stat_acct(req, 0); + blk_account_io_start(req, false); return true; } -static bool bio_attempt_front_merge(struct request_queue *q, - struct request *req, struct bio *bio) +bool bio_attempt_front_merge(struct request_queue *q, struct request *req, + struct bio *bio) { const int ff = bio->bi_rw & REQ_FAILFAST_MASK; @@ -1388,12 +1374,12 @@ static bool bio_attempt_front_merge(struct request_queue *q, req->__data_len += bio->bi_size; req->ioprio = ioprio_best(req->ioprio, bio_prio(bio)); - drive_stat_acct(req, 0); + blk_account_io_start(req, false); return true; } /** - * attempt_plug_merge - try to merge with %current's plugged list + * blk_attempt_plug_merge - try to merge with %current's plugged list * @q: request_queue new bio is being queued at * @bio: new bio being queued * @request_count: out parameter for number of traversed plugged requests @@ -1409,8 +1395,8 @@ static bool bio_attempt_front_merge(struct request_queue *q, * reliable access to the elevator outside queue lock. Only check basic * merging parameters without querying the elevator. */ -static bool attempt_plug_merge(struct request_queue *q, struct bio *bio, - unsigned int *request_count) +bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, + unsigned int *request_count) { struct blk_plug *plug; struct request *rq; @@ -1489,7 +1475,7 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio) * Check if we can merge with the plugged list before grabbing * any locks. */ - if (attempt_plug_merge(q, bio, &request_count)) + if (blk_attempt_plug_merge(q, bio, &request_count)) return; spin_lock_irq(q->queue_lock); @@ -1557,7 +1543,7 @@ get_rq: } } list_add_tail(&req->queuelist, &plug->list); - drive_stat_acct(req, 1); + blk_account_io_start(req, true); } else { spin_lock_irq(q->queue_lock); add_acct_request(q, req, where); @@ -2011,7 +1997,7 @@ unsigned int blk_rq_err_bytes(const struct request *rq) } EXPORT_SYMBOL_GPL(blk_rq_err_bytes); -static void blk_account_io_completion(struct request *req, unsigned int bytes) +void blk_account_io_completion(struct request *req, unsigned int bytes) { if (blk_do_io_stat(req)) { const int rw = rq_data_dir(req); @@ -2025,7 +2011,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes) } } -static void blk_account_io_done(struct request *req) +void blk_account_io_done(struct request *req) { /* * Account IO completion. flush_rq isn't accounted as a @@ -2073,6 +2059,42 @@ static inline struct request *blk_pm_peek_request(struct request_queue *q, } #endif +void blk_account_io_start(struct request *rq, bool new_io) +{ + struct hd_struct *part; + int rw = rq_data_dir(rq); + int cpu; + + if (!blk_do_io_stat(rq)) + return; + + cpu = part_stat_lock(); + + if (!new_io) { + part = rq->part; + part_stat_inc(cpu, part, merges[rw]); + } else { + part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); + if (!hd_struct_try_get(part)) { + /* + * The partition is already being removed, + * the request will be accounted on the disk only + * + * We take a reference on disk->part0 although that + * partition will never be deleted, so we can treat + * it as any other partition. + */ + part = &rq->rq_disk->part0; + hd_struct_get(part); + } + part_round_stats(cpu, part); + part_inc_in_flight(part, rw); + rq->part = part; + } + + part_stat_unlock(); +} + /** * blk_peek_request - peek at the top of a request queue * @q: request queue to peek at @@ -2448,7 +2470,6 @@ static void blk_finish_request(struct request *req, int error) if (req->cmd_flags & REQ_DONTPREP) blk_unprep_request(req); - blk_account_io_done(req); if (req->end_io) @@ -2870,6 +2891,7 @@ void blk_start_plug(struct blk_plug *plug) plug->magic = PLUG_MAGIC; INIT_LIST_HEAD(&plug->list); + INIT_LIST_HEAD(&plug->mq_list); INIT_LIST_HEAD(&plug->cb_list); /* @@ -2967,6 +2989,10 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule) BUG_ON(plug->magic != PLUG_MAGIC); flush_plug_callbacks(plug, from_schedule); + + if (!list_empty(&plug->mq_list)) + blk_mq_flush_plug_list(plug, from_schedule); + if (list_empty(&plug->list)) return; diff --git a/block/blk-exec.c b/block/blk-exec.c index 6b18d82d91c5..c3edf9dff566 100644 --- a/block/blk-exec.c +++ b/block/blk-exec.c @@ -5,6 +5,7 @@ #include #include #include +#include #include #include "blk.h" @@ -58,6 +59,12 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk, rq->rq_disk = bd_disk; rq->end_io = done; + + if (q->mq_ops) { + blk_mq_insert_request(q, rq, true); + return; + } + /* * need to check this before __blk_run_queue(), because rq can * be freed before that returns. diff --git a/block/blk-flush.c b/block/blk-flush.c index cc2b827a853c..3e4cc9c7890a 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -69,8 +69,10 @@ #include #include #include +#include #include "blk.h" +#include "blk-mq.h" /* FLUSH/FUA sequences */ enum { @@ -124,6 +126,24 @@ static void blk_flush_restore_request(struct request *rq) /* make @rq a normal request */ rq->cmd_flags &= ~REQ_FLUSH_SEQ; rq->end_io = rq->flush.saved_end_io; + + blk_clear_rq_complete(rq); +} + +static void mq_flush_data_run(struct work_struct *work) +{ + struct request *rq; + + rq = container_of(work, struct request, mq_flush_data); + + memset(&rq->csd, 0, sizeof(rq->csd)); + blk_mq_run_request(rq, true, false); +} + +static void blk_mq_flush_data_insert(struct request *rq) +{ + INIT_WORK(&rq->mq_flush_data, mq_flush_data_run); + kblockd_schedule_work(rq->q, &rq->mq_flush_data); } /** @@ -136,7 +156,7 @@ static void blk_flush_restore_request(struct request *rq) * completion and trigger the next step. * * CONTEXT: - * spin_lock_irq(q->queue_lock) + * spin_lock_irq(q->queue_lock or q->mq_flush_lock) * * RETURNS: * %true if requests were added to the dispatch queue, %false otherwise. @@ -146,7 +166,7 @@ static bool blk_flush_complete_seq(struct request *rq, unsigned int seq, { struct request_queue *q = rq->q; struct list_head *pending = &q->flush_queue[q->flush_pending_idx]; - bool queued = false; + bool queued = false, kicked; BUG_ON(rq->flush.seq & seq); rq->flush.seq |= seq; @@ -167,8 +187,12 @@ static bool blk_flush_complete_seq(struct request *rq, unsigned int seq, case REQ_FSEQ_DATA: list_move_tail(&rq->flush.list, &q->flush_data_in_flight); - list_add(&rq->queuelist, &q->queue_head); - queued = true; + if (q->mq_ops) + blk_mq_flush_data_insert(rq); + else { + list_add(&rq->queuelist, &q->queue_head); + queued = true; + } break; case REQ_FSEQ_DONE: @@ -181,28 +205,43 @@ static bool blk_flush_complete_seq(struct request *rq, unsigned int seq, BUG_ON(!list_empty(&rq->queuelist)); list_del_init(&rq->flush.list); blk_flush_restore_request(rq); - __blk_end_request_all(rq, error); + if (q->mq_ops) + blk_mq_end_io(rq, error); + else + __blk_end_request_all(rq, error); break; default: BUG(); } - return blk_kick_flush(q) | queued; + kicked = blk_kick_flush(q); + /* blk_mq_run_flush will run queue */ + if (q->mq_ops) + return queued; + return kicked | queued; } static void flush_end_io(struct request *flush_rq, int error) { struct request_queue *q = flush_rq->q; - struct list_head *running = &q->flush_queue[q->flush_running_idx]; + struct list_head *running; bool queued = false; struct request *rq, *n; + unsigned long flags = 0; + if (q->mq_ops) { + blk_mq_free_request(flush_rq); + spin_lock_irqsave(&q->mq_flush_lock, flags); + } + running = &q->flush_queue[q->flush_running_idx]; BUG_ON(q->flush_pending_idx == q->flush_running_idx); /* account completion of the flush request */ q->flush_running_idx ^= 1; - elv_completed_request(q, flush_rq); + + if (!q->mq_ops) + elv_completed_request(q, flush_rq); /* and push the waiting requests to the next stage */ list_for_each_entry_safe(rq, n, running, flush.list) { @@ -223,9 +262,48 @@ static void flush_end_io(struct request *flush_rq, int error) * directly into request_fn may confuse the driver. Always use * kblockd. */ - if (queued || q->flush_queue_delayed) - blk_run_queue_async(q); + if (queued || q->flush_queue_delayed) { + if (!q->mq_ops) + blk_run_queue_async(q); + else + /* + * This can be optimized to only run queues with requests + * queued if necessary. + */ + blk_mq_run_queues(q, true); + } q->flush_queue_delayed = 0; + if (q->mq_ops) + spin_unlock_irqrestore(&q->mq_flush_lock, flags); +} + +static void mq_flush_work(struct work_struct *work) +{ + struct request_queue *q; + struct request *rq; + + q = container_of(work, struct request_queue, mq_flush_work); + + /* We don't need set REQ_FLUSH_SEQ, it's for consistency */ + rq = blk_mq_alloc_request(q, WRITE_FLUSH|REQ_FLUSH_SEQ, + __GFP_WAIT|GFP_ATOMIC); + rq->cmd_type = REQ_TYPE_FS; + rq->end_io = flush_end_io; + + blk_mq_run_request(rq, true, false); +} + +/* + * We can't directly use q->flush_rq, because it doesn't have tag and is not in + * hctx->rqs[]. so we must allocate a new request, since we can't sleep here, + * so offload the work to workqueue. + * + * Note: we assume a flush request finished in any hardware queue will flush + * the whole disk cache. + */ +static void mq_run_flush(struct request_queue *q) +{ + kblockd_schedule_work(q, &q->mq_flush_work); } /** @@ -236,7 +314,7 @@ static void flush_end_io(struct request *flush_rq, int error) * Please read the comment at the top of this file for more info. * * CONTEXT: - * spin_lock_irq(q->queue_lock) + * spin_lock_irq(q->queue_lock or q->mq_flush_lock) * * RETURNS: * %true if flush was issued, %false otherwise. @@ -261,13 +339,18 @@ static bool blk_kick_flush(struct request_queue *q) * Issue flush and toggle pending_idx. This makes pending_idx * different from running_idx, which means flush is in flight. */ + q->flush_pending_idx ^= 1; + if (q->mq_ops) { + mq_run_flush(q); + return true; + } + blk_rq_init(q, &q->flush_rq); q->flush_rq.cmd_type = REQ_TYPE_FS; q->flush_rq.cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ; q->flush_rq.rq_disk = first_rq->rq_disk; q->flush_rq.end_io = flush_end_io; - q->flush_pending_idx ^= 1; list_add_tail(&q->flush_rq.queuelist, &q->queue_head); return true; } @@ -284,16 +367,37 @@ static void flush_data_end_io(struct request *rq, int error) blk_run_queue_async(q); } +static void mq_flush_data_end_io(struct request *rq, int error) +{ + struct request_queue *q = rq->q; + struct blk_mq_hw_ctx *hctx; + struct blk_mq_ctx *ctx; + unsigned long flags; + + ctx = rq->mq_ctx; + hctx = q->mq_ops->map_queue(q, ctx->cpu); + + /* + * After populating an empty queue, kick it to avoid stall. Read + * the comment in flush_end_io(). + */ + spin_lock_irqsave(&q->mq_flush_lock, flags); + if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error)) + blk_mq_run_hw_queue(hctx, true); + spin_unlock_irqrestore(&q->mq_flush_lock, flags); +} + /** * blk_insert_flush - insert a new FLUSH/FUA request * @rq: request to insert * * To be called from __elv_add_request() for %ELEVATOR_INSERT_FLUSH insertions. + * or __blk_mq_run_hw_queue() to dispatch request. * @rq is being submitted. Analyze what needs to be done and put it on the * right queue. * * CONTEXT: - * spin_lock_irq(q->queue_lock) + * spin_lock_irq(q->queue_lock) in !mq case */ void blk_insert_flush(struct request *rq) { @@ -316,7 +420,10 @@ void blk_insert_flush(struct request *rq) * complete the request. */ if (!policy) { - __blk_end_bidi_request(rq, 0, 0, 0); + if (q->mq_ops) + blk_mq_end_io(rq, 0); + else + __blk_end_bidi_request(rq, 0, 0, 0); return; } @@ -329,7 +436,10 @@ void blk_insert_flush(struct request *rq) */ if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { - list_add_tail(&rq->queuelist, &q->queue_head); + if (q->mq_ops) { + blk_mq_run_request(rq, false, true); + } else + list_add_tail(&rq->queuelist, &q->queue_head); return; } @@ -341,6 +451,14 @@ void blk_insert_flush(struct request *rq) INIT_LIST_HEAD(&rq->flush.list); rq->cmd_flags |= REQ_FLUSH_SEQ; rq->flush.saved_end_io = rq->end_io; /* Usually NULL */ + if (q->mq_ops) { + rq->end_io = mq_flush_data_end_io; + + spin_lock_irq(&q->mq_flush_lock); + blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0); + spin_unlock_irq(&q->mq_flush_lock); + return; + } rq->end_io = flush_data_end_io; blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0); @@ -453,3 +571,9 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, return ret; } EXPORT_SYMBOL(blkdev_issue_flush); + +void blk_mq_init_flush(struct request_queue *q) +{ + spin_lock_init(&q->mq_flush_lock); + INIT_WORK(&q->mq_flush_work, mq_flush_work); +} diff --git a/block/blk-mq-cpu.c b/block/blk-mq-cpu.c new file mode 100644 index 000000000000..f8ea39d7ae54 --- /dev/null +++ b/block/blk-mq-cpu.c @@ -0,0 +1,93 @@ +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include "blk-mq.h" + +static LIST_HEAD(blk_mq_cpu_notify_list); +static DEFINE_SPINLOCK(blk_mq_cpu_notify_lock); + +static int __cpuinit blk_mq_main_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + unsigned int cpu = (unsigned long) hcpu; + struct blk_mq_cpu_notifier *notify; + + spin_lock(&blk_mq_cpu_notify_lock); + + list_for_each_entry(notify, &blk_mq_cpu_notify_list, list) + notify->notify(notify->data, action, cpu); + + spin_unlock(&blk_mq_cpu_notify_lock); + return NOTIFY_OK; +} + +static void __cpuinit blk_mq_cpu_notify(void *data, unsigned long action, + unsigned int cpu) +{ + if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) { + /* + * If the CPU goes away, ensure that we run any pending + * completions. + */ + struct llist_node *node; + struct request *rq; + + local_irq_disable(); + + node = llist_del_all(&per_cpu(ipi_lists, cpu)); + while (node) { + struct llist_node *next = node->next; + + rq = llist_entry(node, struct request, ll_list); + __blk_mq_end_io(rq, rq->errors); + node = next; + } + + local_irq_enable(); + } +} + +static struct notifier_block __cpuinitdata blk_mq_main_cpu_notifier = { + .notifier_call = blk_mq_main_cpu_notify, +}; + +void blk_mq_register_cpu_notifier(struct blk_mq_cpu_notifier *notifier) +{ + BUG_ON(!notifier->notify); + + spin_lock(&blk_mq_cpu_notify_lock); + list_add_tail(¬ifier->list, &blk_mq_cpu_notify_list); + spin_unlock(&blk_mq_cpu_notify_lock); +} + +void blk_mq_unregister_cpu_notifier(struct blk_mq_cpu_notifier *notifier) +{ + spin_lock(&blk_mq_cpu_notify_lock); + list_del(¬ifier->list); + spin_unlock(&blk_mq_cpu_notify_lock); +} + +void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier, + void (*fn)(void *, unsigned long, unsigned int), + void *data) +{ + notifier->notify = fn; + notifier->data = data; +} + +static struct blk_mq_cpu_notifier __cpuinitdata cpu_notifier = { + .notify = blk_mq_cpu_notify, +}; + +void __init blk_mq_cpu_init(void) +{ + register_hotcpu_notifier(&blk_mq_main_cpu_notifier); + blk_mq_register_cpu_notifier(&cpu_notifier); +} diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c new file mode 100644 index 000000000000..f8721278601c --- /dev/null +++ b/block/blk-mq-cpumap.c @@ -0,0 +1,108 @@ +#include +#include +#include +#include +#include +#include + +#include +#include "blk.h" +#include "blk-mq.h" + +static void show_map(unsigned int *map, unsigned int nr) +{ + int i; + + pr_info("blk-mq: CPU -> queue map\n"); + for_each_online_cpu(i) + pr_info(" CPU%2u -> Queue %u\n", i, map[i]); +} + +static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues, + const int cpu) +{ + return cpu / ((nr_cpus + nr_queues - 1) / nr_queues); +} + +static int get_first_sibling(unsigned int cpu) +{ + unsigned int ret; + + ret = cpumask_first(topology_thread_cpumask(cpu)); + if (ret < nr_cpu_ids) + return ret; + + return cpu; +} + +int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues) +{ + unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling; + cpumask_var_t cpus; + + if (!alloc_cpumask_var(&cpus, GFP_ATOMIC)) + return 1; + + cpumask_clear(cpus); + nr_cpus = nr_uniq_cpus = 0; + for_each_online_cpu(i) { + nr_cpus++; + first_sibling = get_first_sibling(i); + if (!cpumask_test_cpu(first_sibling, cpus)) + nr_uniq_cpus++; + cpumask_set_cpu(i, cpus); + } + + queue = 0; + for_each_possible_cpu(i) { + if (!cpu_online(i)) { + map[i] = 0; + continue; + } + + /* + * Easy case - we have equal or more hardware queues. Or + * there are no thread siblings to take into account. Do + * 1:1 if enough, or sequential mapping if less. + */ + if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) { + map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue); + queue++; + continue; + } + + /* + * Less then nr_cpus queues, and we have some number of + * threads per cores. Map sibling threads to the same + * queue. + */ + first_sibling = get_first_sibling(i); + if (first_sibling == i) { + map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues, + queue); + queue++; + } else + map[i] = map[first_sibling]; + } + + show_map(map, nr_cpus); + free_cpumask_var(cpus); + return 0; +} + +unsigned int *blk_mq_make_queue_map(struct blk_mq_reg *reg) +{ + unsigned int *map; + + /* If cpus are offline, map them to first hctx */ + map = kzalloc_node(sizeof(*map) * num_possible_cpus(), GFP_KERNEL, + reg->numa_node); + if (!map) + return NULL; + + if (!blk_mq_update_queue_map(map, reg->nr_hw_queues)) + return map; + + kfree(map); + return NULL; +} diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c new file mode 100644 index 000000000000..ba6cf8e9aa0a --- /dev/null +++ b/block/blk-mq-sysfs.c @@ -0,0 +1,384 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include "blk-mq.h" +#include "blk-mq-tag.h" + +static void blk_mq_sysfs_release(struct kobject *kobj) +{ +} + +struct blk_mq_ctx_sysfs_entry { + struct attribute attr; + ssize_t (*show)(struct blk_mq_ctx *, char *); + ssize_t (*store)(struct blk_mq_ctx *, const char *, size_t); +}; + +struct blk_mq_hw_ctx_sysfs_entry { + struct attribute attr; + ssize_t (*show)(struct blk_mq_hw_ctx *, char *); + ssize_t (*store)(struct blk_mq_hw_ctx *, const char *, size_t); +}; + +static ssize_t blk_mq_sysfs_show(struct kobject *kobj, struct attribute *attr, + char *page) +{ + struct blk_mq_ctx_sysfs_entry *entry; + struct blk_mq_ctx *ctx; + struct request_queue *q; + ssize_t res; + + entry = container_of(attr, struct blk_mq_ctx_sysfs_entry, attr); + ctx = container_of(kobj, struct blk_mq_ctx, kobj); + q = ctx->queue; + + if (!entry->show) + return -EIO; + + res = -ENOENT; + mutex_lock(&q->sysfs_lock); + if (!blk_queue_dying(q)) + res = entry->show(ctx, page); + mutex_unlock(&q->sysfs_lock); + return res; +} + +static ssize_t blk_mq_sysfs_store(struct kobject *kobj, struct attribute *attr, + const char *page, size_t length) +{ + struct blk_mq_ctx_sysfs_entry *entry; + struct blk_mq_ctx *ctx; + struct request_queue *q; + ssize_t res; + + entry = container_of(attr, struct blk_mq_ctx_sysfs_entry, attr); + ctx = container_of(kobj, struct blk_mq_ctx, kobj); + q = ctx->queue; + + if (!entry->store) + return -EIO; + + res = -ENOENT; + mutex_lock(&q->sysfs_lock); + if (!blk_queue_dying(q)) + res = entry->store(ctx, page, length); + mutex_unlock(&q->sysfs_lock); + return res; +} + +static ssize_t blk_mq_hw_sysfs_show(struct kobject *kobj, + struct attribute *attr, char *page) +{ + struct blk_mq_hw_ctx_sysfs_entry *entry; + struct blk_mq_hw_ctx *hctx; + struct request_queue *q; + ssize_t res; + + entry = container_of(attr, struct blk_mq_hw_ctx_sysfs_entry, attr); + hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj); + q = hctx->queue; + + if (!entry->show) + return -EIO; + + res = -ENOENT; + mutex_lock(&q->sysfs_lock); + if (!blk_queue_dying(q)) + res = entry->show(hctx, page); + mutex_unlock(&q->sysfs_lock); + return res; +} + +static ssize_t blk_mq_hw_sysfs_store(struct kobject *kobj, + struct attribute *attr, const char *page, + size_t length) +{ + struct blk_mq_hw_ctx_sysfs_entry *entry; + struct blk_mq_hw_ctx *hctx; + struct request_queue *q; + ssize_t res; + + entry = container_of(attr, struct blk_mq_hw_ctx_sysfs_entry, attr); + hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj); + q = hctx->queue; + + if (!entry->store) + return -EIO; + + res = -ENOENT; + mutex_lock(&q->sysfs_lock); + if (!blk_queue_dying(q)) + res = entry->store(hctx, page, length); + mutex_unlock(&q->sysfs_lock); + return res; +} + +static ssize_t blk_mq_sysfs_dispatched_show(struct blk_mq_ctx *ctx, char *page) +{ + return sprintf(page, "%lu %lu\n", ctx->rq_dispatched[1], + ctx->rq_dispatched[0]); +} + +static ssize_t blk_mq_sysfs_merged_show(struct blk_mq_ctx *ctx, char *page) +{ + return sprintf(page, "%lu\n", ctx->rq_merged); +} + +static ssize_t blk_mq_sysfs_completed_show(struct blk_mq_ctx *ctx, char *page) +{ + return sprintf(page, "%lu %lu\n", ctx->rq_completed[1], + ctx->rq_completed[0]); +} + +static ssize_t sysfs_list_show(char *page, struct list_head *list, char *msg) +{ + char *start_page = page; + struct request *rq; + + page += sprintf(page, "%s:\n", msg); + + list_for_each_entry(rq, list, queuelist) + page += sprintf(page, "\t%p\n", rq); + + return page - start_page; +} + +static ssize_t blk_mq_sysfs_rq_list_show(struct blk_mq_ctx *ctx, char *page) +{ + ssize_t ret; + + spin_lock(&ctx->lock); + ret = sysfs_list_show(page, &ctx->rq_list, "CTX pending"); + spin_unlock(&ctx->lock); + + return ret; +} + +static ssize_t blk_mq_hw_sysfs_queued_show(struct blk_mq_hw_ctx *hctx, + char *page) +{ + return sprintf(page, "%lu\n", hctx->queued); +} + +static ssize_t blk_mq_hw_sysfs_run_show(struct blk_mq_hw_ctx *hctx, char *page) +{ + return sprintf(page, "%lu\n", hctx->run); +} + +static ssize_t blk_mq_hw_sysfs_dispatched_show(struct blk_mq_hw_ctx *hctx, + char *page) +{ + char *start_page = page; + int i; + + page += sprintf(page, "%8u\t%lu\n", 0U, hctx->dispatched[0]); + + for (i = 1; i < BLK_MQ_MAX_DISPATCH_ORDER; i++) { + unsigned long d = 1U << (i - 1); + + page += sprintf(page, "%8lu\t%lu\n", d, hctx->dispatched[i]); + } + + return page - start_page; +} + +static ssize_t blk_mq_hw_sysfs_rq_list_show(struct blk_mq_hw_ctx *hctx, + char *page) +{ + ssize_t ret; + + spin_lock(&hctx->lock); + ret = sysfs_list_show(page, &hctx->dispatch, "HCTX pending"); + spin_unlock(&hctx->lock); + + return ret; +} + +static ssize_t blk_mq_hw_sysfs_ipi_show(struct blk_mq_hw_ctx *hctx, char *page) +{ + ssize_t ret; + + spin_lock(&hctx->lock); + ret = sprintf(page, "%u\n", !!(hctx->flags & BLK_MQ_F_SHOULD_IPI)); + spin_unlock(&hctx->lock); + + return ret; +} + +static ssize_t blk_mq_hw_sysfs_ipi_store(struct blk_mq_hw_ctx *hctx, + const char *page, size_t len) +{ + struct blk_mq_ctx *ctx; + unsigned long ret; + unsigned int i; + + if (kstrtoul(page, 10, &ret)) { + pr_err("blk-mq-sysfs: invalid input '%s'\n", page); + return -EINVAL; + } + + spin_lock(&hctx->lock); + if (ret) + hctx->flags |= BLK_MQ_F_SHOULD_IPI; + else + hctx->flags &= ~BLK_MQ_F_SHOULD_IPI; + spin_unlock(&hctx->lock); + + hctx_for_each_ctx(hctx, ctx, i) + ctx->ipi_redirect = !!ret; + + return len; +} + +static ssize_t blk_mq_hw_sysfs_tags_show(struct blk_mq_hw_ctx *hctx, char *page) +{ + return blk_mq_tag_sysfs_show(hctx->tags, page); +} + +static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = { + .attr = {.name = "dispatched", .mode = S_IRUGO }, + .show = blk_mq_sysfs_dispatched_show, +}; +static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_merged = { + .attr = {.name = "merged", .mode = S_IRUGO }, + .show = blk_mq_sysfs_merged_show, +}; +static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_completed = { + .attr = {.name = "completed", .mode = S_IRUGO }, + .show = blk_mq_sysfs_completed_show, +}; +static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_rq_list = { + .attr = {.name = "rq_list", .mode = S_IRUGO }, + .show = blk_mq_sysfs_rq_list_show, +}; + +static struct attribute *default_ctx_attrs[] = { + &blk_mq_sysfs_dispatched.attr, + &blk_mq_sysfs_merged.attr, + &blk_mq_sysfs_completed.attr, + &blk_mq_sysfs_rq_list.attr, + NULL, +}; + +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_queued = { + .attr = {.name = "queued", .mode = S_IRUGO }, + .show = blk_mq_hw_sysfs_queued_show, +}; +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_run = { + .attr = {.name = "run", .mode = S_IRUGO }, + .show = blk_mq_hw_sysfs_run_show, +}; +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_dispatched = { + .attr = {.name = "dispatched", .mode = S_IRUGO }, + .show = blk_mq_hw_sysfs_dispatched_show, +}; +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_pending = { + .attr = {.name = "pending", .mode = S_IRUGO }, + .show = blk_mq_hw_sysfs_rq_list_show, +}; +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_ipi = { + .attr = {.name = "ipi_redirect", .mode = S_IRUGO | S_IWUSR}, + .show = blk_mq_hw_sysfs_ipi_show, + .store = blk_mq_hw_sysfs_ipi_store, +}; +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_tags = { + .attr = {.name = "tags", .mode = S_IRUGO }, + .show = blk_mq_hw_sysfs_tags_show, +}; + +static struct attribute *default_hw_ctx_attrs[] = { + &blk_mq_hw_sysfs_queued.attr, + &blk_mq_hw_sysfs_run.attr, + &blk_mq_hw_sysfs_dispatched.attr, + &blk_mq_hw_sysfs_pending.attr, + &blk_mq_hw_sysfs_ipi.attr, + &blk_mq_hw_sysfs_tags.attr, + NULL, +}; + +static const struct sysfs_ops blk_mq_sysfs_ops = { + .show = blk_mq_sysfs_show, + .store = blk_mq_sysfs_store, +}; + +static const struct sysfs_ops blk_mq_hw_sysfs_ops = { + .show = blk_mq_hw_sysfs_show, + .store = blk_mq_hw_sysfs_store, +}; + +static struct kobj_type blk_mq_ktype = { + .sysfs_ops = &blk_mq_sysfs_ops, + .release = blk_mq_sysfs_release, +}; + +static struct kobj_type blk_mq_ctx_ktype = { + .sysfs_ops = &blk_mq_sysfs_ops, + .default_attrs = default_ctx_attrs, + .release = blk_mq_sysfs_release, +}; + +static struct kobj_type blk_mq_hw_ktype = { + .sysfs_ops = &blk_mq_hw_sysfs_ops, + .default_attrs = default_hw_ctx_attrs, + .release = blk_mq_sysfs_release, +}; + +void blk_mq_unregister_disk(struct gendisk *disk) +{ + struct request_queue *q = disk->queue; + + kobject_uevent(&q->mq_kobj, KOBJ_REMOVE); + kobject_del(&q->mq_kobj); + + kobject_put(&disk_to_dev(disk)->kobj); +} + +int blk_mq_register_disk(struct gendisk *disk) +{ + struct device *dev = disk_to_dev(disk); + struct request_queue *q = disk->queue; + struct blk_mq_hw_ctx *hctx; + struct blk_mq_ctx *ctx; + int ret, i, j; + + kobject_init(&q->mq_kobj, &blk_mq_ktype); + + ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq"); + if (ret < 0) + return ret; + + kobject_uevent(&q->mq_kobj, KOBJ_ADD); + + queue_for_each_hw_ctx(q, hctx, i) { + kobject_init(&hctx->kobj, &blk_mq_hw_ktype); + ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", i); + if (ret) + break; + + if (!hctx->nr_ctx) + continue; + + hctx_for_each_ctx(hctx, ctx, j) { + kobject_init(&ctx->kobj, &blk_mq_ctx_ktype); + ret = kobject_add(&ctx->kobj, &hctx->kobj, "cpu%u", ctx->cpu); + if (ret) + break; + } + } + + if (ret) { + blk_mq_unregister_disk(disk); + return ret; + } + + return 0; +} diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c new file mode 100644 index 000000000000..d64a02fb1f73 --- /dev/null +++ b/block/blk-mq-tag.c @@ -0,0 +1,204 @@ +#include +#include +#include + +#include +#include "blk.h" +#include "blk-mq.h" +#include "blk-mq-tag.h" + +/* + * Per tagged queue (tag address space) map + */ +struct blk_mq_tags { + unsigned int nr_tags; + unsigned int nr_reserved_tags; + unsigned int nr_batch_move; + unsigned int nr_max_cache; + + struct percpu_ida free_tags; + struct percpu_ida reserved_tags; +}; + +void blk_mq_wait_for_tags(struct blk_mq_tags *tags) +{ + int tag = blk_mq_get_tag(tags, __GFP_WAIT, false); + blk_mq_put_tag(tags, tag); +} + +bool blk_mq_has_free_tags(struct blk_mq_tags *tags) +{ + return !tags || + percpu_ida_free_tags(&tags->free_tags, nr_cpu_ids) != 0; +} + +static unsigned int __blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp) +{ + int tag; + + tag = percpu_ida_alloc(&tags->free_tags, gfp); + if (tag < 0) + return BLK_MQ_TAG_FAIL; + return tag + tags->nr_reserved_tags; +} + +static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_tags *tags, + gfp_t gfp) +{ + int tag; + + if (unlikely(!tags->nr_reserved_tags)) { + WARN_ON_ONCE(1); + return BLK_MQ_TAG_FAIL; + } + + tag = percpu_ida_alloc(&tags->reserved_tags, gfp); + if (tag < 0) + return BLK_MQ_TAG_FAIL; + return tag; +} + +unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp, bool reserved) +{ + if (!reserved) + return __blk_mq_get_tag(tags, gfp); + + return __blk_mq_get_reserved_tag(tags, gfp); +} + +static void __blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag) +{ + BUG_ON(tag >= tags->nr_tags); + + percpu_ida_free(&tags->free_tags, tag - tags->nr_reserved_tags); +} + +static void __blk_mq_put_reserved_tag(struct blk_mq_tags *tags, + unsigned int tag) +{ + BUG_ON(tag >= tags->nr_reserved_tags); + + percpu_ida_free(&tags->reserved_tags, tag); +} + +void blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag) +{ + if (tag >= tags->nr_reserved_tags) + __blk_mq_put_tag(tags, tag); + else + __blk_mq_put_reserved_tag(tags, tag); +} + +static int __blk_mq_tag_iter(unsigned id, void *data) +{ + unsigned long *tag_map = data; + __set_bit(id, tag_map); + return 0; +} + +void blk_mq_tag_busy_iter(struct blk_mq_tags *tags, + void (*fn)(void *, unsigned long *), void *data) +{ + unsigned long *tag_map; + size_t map_size; + + map_size = ALIGN(tags->nr_tags, BITS_PER_LONG) / BITS_PER_LONG; + tag_map = kzalloc(map_size * sizeof(unsigned long), GFP_ATOMIC); + if (!tag_map) + return; + + percpu_ida_for_each_free(&tags->free_tags, __blk_mq_tag_iter, tag_map); + if (tags->nr_reserved_tags) + percpu_ida_for_each_free(&tags->reserved_tags, __blk_mq_tag_iter, + tag_map); + + fn(data, tag_map); + kfree(tag_map); +} + +struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, + unsigned int reserved_tags, int node) +{ + unsigned int nr_tags, nr_cache; + struct blk_mq_tags *tags; + int ret; + + if (total_tags > BLK_MQ_TAG_MAX) { + pr_err("blk-mq: tag depth too large\n"); + return NULL; + } + + tags = kzalloc_node(sizeof(*tags), GFP_KERNEL, node); + if (!tags) + return NULL; + + nr_tags = total_tags - reserved_tags; + nr_cache = nr_tags / num_possible_cpus(); + + if (nr_cache < BLK_MQ_TAG_CACHE_MIN) + nr_cache = BLK_MQ_TAG_CACHE_MIN; + else if (nr_cache > BLK_MQ_TAG_CACHE_MAX) + nr_cache = BLK_MQ_TAG_CACHE_MAX; + + tags->nr_tags = total_tags; + tags->nr_reserved_tags = reserved_tags; + tags->nr_max_cache = nr_cache; + tags->nr_batch_move = max(1u, nr_cache / 2); + + ret = __percpu_ida_init(&tags->free_tags, tags->nr_tags - + tags->nr_reserved_tags, + tags->nr_max_cache, + tags->nr_batch_move); + if (ret) + goto err_free_tags; + + if (reserved_tags) { + /* + * With max_cahe and batch set to 1, the allocator fallbacks to + * no cached. It's fine reserved tags allocation is slow. + */ + ret = __percpu_ida_init(&tags->reserved_tags, reserved_tags, + 1, 1); + if (ret) + goto err_reserved_tags; + } + + return tags; + +err_reserved_tags: + percpu_ida_destroy(&tags->free_tags); +err_free_tags: + kfree(tags); + return NULL; +} + +void blk_mq_free_tags(struct blk_mq_tags *tags) +{ + percpu_ida_destroy(&tags->free_tags); + percpu_ida_destroy(&tags->reserved_tags); + kfree(tags); +} + +ssize_t blk_mq_tag_sysfs_show(struct blk_mq_tags *tags, char *page) +{ + char *orig_page = page; + int cpu; + + if (!tags) + return 0; + + page += sprintf(page, "nr_tags=%u, reserved_tags=%u, batch_move=%u," + " max_cache=%u\n", tags->nr_tags, tags->nr_reserved_tags, + tags->nr_batch_move, tags->nr_max_cache); + + page += sprintf(page, "nr_free=%u, nr_reserved=%u\n", + percpu_ida_free_tags(&tags->free_tags, nr_cpu_ids), + percpu_ida_free_tags(&tags->reserved_tags, nr_cpu_ids)); + + for_each_possible_cpu(cpu) { + page += sprintf(page, " cpu%02u: nr_free=%u\n", cpu, + percpu_ida_free_tags(&tags->free_tags, cpu)); + } + + return page - orig_page; +} diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h new file mode 100644 index 000000000000..947ba2c6148e --- /dev/null +++ b/block/blk-mq-tag.h @@ -0,0 +1,27 @@ +#ifndef INT_BLK_MQ_TAG_H +#define INT_BLK_MQ_TAG_H + +struct blk_mq_tags; + +extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node); +extern void blk_mq_free_tags(struct blk_mq_tags *tags); + +extern unsigned int blk_mq_get_tag(struct blk_mq_tags *tags, gfp_t gfp, bool reserved); +extern void blk_mq_wait_for_tags(struct blk_mq_tags *tags); +extern void blk_mq_put_tag(struct blk_mq_tags *tags, unsigned int tag); +extern void blk_mq_tag_busy_iter(struct blk_mq_tags *tags, void (*fn)(void *data, unsigned long *), void *data); +extern bool blk_mq_has_free_tags(struct blk_mq_tags *tags); +extern ssize_t blk_mq_tag_sysfs_show(struct blk_mq_tags *tags, char *page); + +enum { + BLK_MQ_TAG_CACHE_MIN = 1, + BLK_MQ_TAG_CACHE_MAX = 64, +}; + +enum { + BLK_MQ_TAG_FAIL = -1U, + BLK_MQ_TAG_MIN = BLK_MQ_TAG_CACHE_MIN, + BLK_MQ_TAG_MAX = BLK_MQ_TAG_FAIL - 1, +}; + +#endif diff --git a/block/blk-mq.c b/block/blk-mq.c new file mode 100644 index 000000000000..f21ec964e411 --- /dev/null +++ b/block/blk-mq.c @@ -0,0 +1,1480 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include "blk.h" +#include "blk-mq.h" +#include "blk-mq-tag.h" + +static DEFINE_MUTEX(all_q_mutex); +static LIST_HEAD(all_q_list); + +static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx); + +DEFINE_PER_CPU(struct llist_head, ipi_lists); + +static struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q, + unsigned int cpu) +{ + return per_cpu_ptr(q->queue_ctx, cpu); +} + +/* + * This assumes per-cpu software queueing queues. They could be per-node + * as well, for instance. For now this is hardcoded as-is. Note that we don't + * care about preemption, since we know the ctx's are persistent. This does + * mean that we can't rely on ctx always matching the currently running CPU. + */ +static struct blk_mq_ctx *blk_mq_get_ctx(struct request_queue *q) +{ + return __blk_mq_get_ctx(q, get_cpu()); +} + +static void blk_mq_put_ctx(struct blk_mq_ctx *ctx) +{ + put_cpu(); +} + +/* + * Check if any of the ctx's have pending work in this hardware queue + */ +static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx) +{ + unsigned int i; + + for (i = 0; i < hctx->nr_ctx_map; i++) + if (hctx->ctx_map[i]) + return true; + + return false; +} + +/* + * Mark this ctx as having pending work in this hardware queue + */ +static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx, + struct blk_mq_ctx *ctx) +{ + if (!test_bit(ctx->index_hw, hctx->ctx_map)) + set_bit(ctx->index_hw, hctx->ctx_map); +} + +static struct request *blk_mq_alloc_rq(struct blk_mq_hw_ctx *hctx, gfp_t gfp, + bool reserved) +{ + struct request *rq; + unsigned int tag; + + tag = blk_mq_get_tag(hctx->tags, gfp, reserved); + if (tag != BLK_MQ_TAG_FAIL) { + rq = hctx->rqs[tag]; + rq->tag = tag; + + return rq; + } + + return NULL; +} + +static int blk_mq_queue_enter(struct request_queue *q) +{ + int ret; + + __percpu_counter_add(&q->mq_usage_counter, 1, 1000000); + smp_wmb(); + /* we have problems to freeze the queue if it's initializing */ + if (!blk_queue_bypass(q) || !blk_queue_init_done(q)) + return 0; + + __percpu_counter_add(&q->mq_usage_counter, -1, 1000000); + + spin_lock_irq(q->queue_lock); + ret = wait_event_interruptible_lock_irq(q->mq_freeze_wq, + !blk_queue_bypass(q), *q->queue_lock); + /* inc usage with lock hold to avoid freeze_queue runs here */ + if (!ret) + __percpu_counter_add(&q->mq_usage_counter, 1, 1000000); + spin_unlock_irq(q->queue_lock); + + return ret; +} + +static void blk_mq_queue_exit(struct request_queue *q) +{ + __percpu_counter_add(&q->mq_usage_counter, -1, 1000000); +} + +/* + * Guarantee no request is in use, so we can change any data structure of + * the queue afterward. + */ +static void blk_mq_freeze_queue(struct request_queue *q) +{ + bool drain; + + spin_lock_irq(q->queue_lock); + drain = !q->bypass_depth++; + queue_flag_set(QUEUE_FLAG_BYPASS, q); + spin_unlock_irq(q->queue_lock); + + if (!drain) + return; + + while (true) { + s64 count; + + spin_lock_irq(q->queue_lock); + count = percpu_counter_sum(&q->mq_usage_counter); + spin_unlock_irq(q->queue_lock); + + if (count == 0) + break; + blk_mq_run_queues(q, false); + msleep(10); + } +} + +static void blk_mq_unfreeze_queue(struct request_queue *q) +{ + bool wake = false; + + spin_lock_irq(q->queue_lock); + if (!--q->bypass_depth) { + queue_flag_clear(QUEUE_FLAG_BYPASS, q); + wake = true; + } + WARN_ON_ONCE(q->bypass_depth < 0); + spin_unlock_irq(q->queue_lock); + if (wake) + wake_up_all(&q->mq_freeze_wq); +} + +bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx) +{ + return blk_mq_has_free_tags(hctx->tags); +} +EXPORT_SYMBOL(blk_mq_can_queue); + +static void blk_mq_rq_ctx_init(struct blk_mq_ctx *ctx, struct request *rq, + unsigned int rw_flags) +{ + rq->mq_ctx = ctx; + rq->cmd_flags = rw_flags; + ctx->rq_dispatched[rw_is_sync(rw_flags)]++; +} + +static struct request *__blk_mq_alloc_request(struct blk_mq_hw_ctx *hctx, + gfp_t gfp, bool reserved) +{ + return blk_mq_alloc_rq(hctx, gfp, reserved); +} + +static struct request *blk_mq_alloc_request_pinned(struct request_queue *q, + int rw, gfp_t gfp, + bool reserved) +{ + struct request *rq; + + do { + struct blk_mq_ctx *ctx = blk_mq_get_ctx(q); + struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q, ctx->cpu); + + rq = __blk_mq_alloc_request(hctx, gfp & ~__GFP_WAIT, reserved); + if (rq) { + blk_mq_rq_ctx_init(ctx, rq, rw); + break; + } else if (!(gfp & __GFP_WAIT)) + break; + + blk_mq_put_ctx(ctx); + __blk_mq_run_hw_queue(hctx); + blk_mq_wait_for_tags(hctx->tags); + } while (1); + + return rq; +} + +struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp) +{ + struct request *rq; + + if (blk_mq_queue_enter(q)) + return NULL; + + rq = blk_mq_alloc_request_pinned(q, rw, gfp, false); + blk_mq_put_ctx(rq->mq_ctx); + return rq; +} + +struct request *blk_mq_alloc_reserved_request(struct request_queue *q, int rw, + gfp_t gfp) +{ + struct request *rq; + + if (blk_mq_queue_enter(q)) + return NULL; + + rq = blk_mq_alloc_request_pinned(q, rw, gfp, true); + blk_mq_put_ctx(rq->mq_ctx); + return rq; +} +EXPORT_SYMBOL(blk_mq_alloc_reserved_request); + +/* + * Re-init and set pdu, if we have it + */ +static void blk_mq_rq_init(struct blk_mq_hw_ctx *hctx, struct request *rq) +{ + blk_rq_init(hctx->queue, rq); + + if (hctx->cmd_size) + rq->special = blk_mq_rq_to_pdu(rq); +} + +static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, + struct blk_mq_ctx *ctx, struct request *rq) +{ + const int tag = rq->tag; + struct request_queue *q = rq->q; + + blk_mq_rq_init(hctx, rq); + blk_mq_put_tag(hctx->tags, tag); + + blk_mq_queue_exit(q); +} + +void blk_mq_free_request(struct request *rq) +{ + struct blk_mq_ctx *ctx = rq->mq_ctx; + struct blk_mq_hw_ctx *hctx; + struct request_queue *q = rq->q; + + ctx->rq_completed[rq_is_sync(rq)]++; + + hctx = q->mq_ops->map_queue(q, ctx->cpu); + __blk_mq_free_request(hctx, ctx, rq); +} + +static void blk_mq_bio_endio(struct request *rq, struct bio *bio, int error) +{ + if (error) + clear_bit(BIO_UPTODATE, &bio->bi_flags); + else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) + error = -EIO; + + if (unlikely(rq->cmd_flags & REQ_QUIET)) + set_bit(BIO_QUIET, &bio->bi_flags); + + /* don't actually finish bio if it's part of flush sequence */ + if (!(rq->cmd_flags & REQ_FLUSH_SEQ)) + bio_endio(bio, error); +} + +void blk_mq_complete_request(struct request *rq, int error) +{ + struct bio *bio = rq->bio; + unsigned int bytes = 0; + + trace_block_rq_complete(rq->q, rq); + + while (bio) { + struct bio *next = bio->bi_next; + + bio->bi_next = NULL; + bytes += bio->bi_size; + blk_mq_bio_endio(rq, bio, error); + bio = next; + } + + blk_account_io_completion(rq, bytes); + + if (rq->end_io) + rq->end_io(rq, error); + else + blk_mq_free_request(rq); + + blk_account_io_done(rq); +} + +void __blk_mq_end_io(struct request *rq, int error) +{ + if (!blk_mark_rq_complete(rq)) + blk_mq_complete_request(rq, error); +} + +#if defined(CONFIG_SMP) && defined(CONFIG_USE_GENERIC_SMP_HELPERS) + +/* + * Called with interrupts disabled. + */ +static void ipi_end_io(void *data) +{ + struct llist_head *list = &per_cpu(ipi_lists, smp_processor_id()); + struct llist_node *entry, *next; + struct request *rq; + + entry = llist_del_all(list); + + while (entry) { + next = entry->next; + rq = llist_entry(entry, struct request, ll_list); + __blk_mq_end_io(rq, rq->errors); + entry = next; + } +} + +static int ipi_remote_cpu(struct blk_mq_ctx *ctx, const int cpu, + struct request *rq, const int error) +{ + struct call_single_data *data = &rq->csd; + + rq->errors = error; + rq->ll_list.next = NULL; + + /* + * If the list is non-empty, an existing IPI must already + * be "in flight". If that is the case, we need not schedule + * a new one. + */ + if (llist_add(&rq->ll_list, &per_cpu(ipi_lists, ctx->cpu))) { + data->func = ipi_end_io; + data->flags = 0; + __smp_call_function_single(ctx->cpu, data, 0); + } + + return true; +} +#else /* CONFIG_SMP && CONFIG_USE_GENERIC_SMP_HELPERS */ +static int ipi_remote_cpu(struct blk_mq_ctx *ctx, const int cpu, + struct request *rq, const int error) +{ + return false; +} +#endif + +/* + * End IO on this request on a multiqueue enabled driver. We'll either do + * it directly inline, or punt to a local IPI handler on the matching + * remote CPU. + */ +void blk_mq_end_io(struct request *rq, int error) +{ + struct blk_mq_ctx *ctx = rq->mq_ctx; + int cpu; + + if (!ctx->ipi_redirect) + return __blk_mq_end_io(rq, error); + + cpu = get_cpu(); + + if (cpu == ctx->cpu || !cpu_online(ctx->cpu) || + !ipi_remote_cpu(ctx, cpu, rq, error)) + __blk_mq_end_io(rq, error); + + put_cpu(); +} +EXPORT_SYMBOL(blk_mq_end_io); + +static void blk_mq_start_request(struct request *rq) +{ + struct request_queue *q = rq->q; + + trace_block_rq_issue(q, rq); + + /* + * Just mark start time and set the started bit. Due to memory + * ordering, we know we'll see the correct deadline as long as + * REQ_ATOMIC_STARTED is seen. + */ + rq->deadline = jiffies + q->rq_timeout; + set_bit(REQ_ATOM_STARTED, &rq->atomic_flags); +} + +static void blk_mq_requeue_request(struct request *rq) +{ + struct request_queue *q = rq->q; + + trace_block_rq_requeue(q, rq); + clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); +} + +struct blk_mq_timeout_data { + struct blk_mq_hw_ctx *hctx; + unsigned long *next; + unsigned int *next_set; +}; + +static void blk_mq_timeout_check(void *__data, unsigned long *free_tags) +{ + struct blk_mq_timeout_data *data = __data; + struct blk_mq_hw_ctx *hctx = data->hctx; + unsigned int tag; + + /* It may not be in flight yet (this is where + * the REQ_ATOMIC_STARTED flag comes in). The requests are + * statically allocated, so we know it's always safe to access the + * memory associated with a bit offset into ->rqs[]. + */ + tag = 0; + do { + struct request *rq; + + tag = find_next_zero_bit(free_tags, hctx->queue_depth, tag); + if (tag >= hctx->queue_depth) + break; + + rq = hctx->rqs[tag++]; + + if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) + continue; + + blk_rq_check_expired(rq, data->next, data->next_set); + } while (1); +} + +static void blk_mq_hw_ctx_check_timeout(struct blk_mq_hw_ctx *hctx, + unsigned long *next, + unsigned int *next_set) +{ + struct blk_mq_timeout_data data = { + .hctx = hctx, + .next = next, + .next_set = next_set, + }; + + /* + * Ask the tagging code to iterate busy requests, so we can + * check them for timeout. + */ + blk_mq_tag_busy_iter(hctx->tags, blk_mq_timeout_check, &data); +} + +static void blk_mq_rq_timer(unsigned long data) +{ + struct request_queue *q = (struct request_queue *) data; + struct blk_mq_hw_ctx *hctx; + unsigned long next = 0; + int i, next_set = 0; + + queue_for_each_hw_ctx(q, hctx, i) + blk_mq_hw_ctx_check_timeout(hctx, &next, &next_set); + + if (next_set) + mod_timer(&q->timeout, round_jiffies_up(next)); +} + +/* + * Reverse check our software queue for entries that we could potentially + * merge with. Currently includes a hand-wavy stop count of 8, to not spend + * too much time checking for merges. + */ +static bool blk_mq_attempt_merge(struct request_queue *q, + struct blk_mq_ctx *ctx, struct bio *bio) +{ + struct request *rq; + int checked = 8; + + list_for_each_entry_reverse(rq, &ctx->rq_list, queuelist) { + int el_ret; + + if (!checked--) + break; + + if (!blk_rq_merge_ok(rq, bio)) + continue; + + el_ret = blk_try_merge(rq, bio); + if (el_ret == ELEVATOR_BACK_MERGE) { + if (bio_attempt_back_merge(q, rq, bio)) { + ctx->rq_merged++; + return true; + } + break; + } else if (el_ret == ELEVATOR_FRONT_MERGE) { + if (bio_attempt_front_merge(q, rq, bio)) { + ctx->rq_merged++; + return true; + } + break; + } + } + + return false; +} + +void blk_mq_add_timer(struct request *rq) +{ + __blk_add_timer(rq, NULL); +} + +/* + * Run this hardware queue, pulling any software queues mapped to it in. + * Note that this function currently has various problems around ordering + * of IO. In particular, we'd like FIFO behaviour on handling existing + * items on the hctx->dispatch list. Ignore that for now. + */ +static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; + struct blk_mq_ctx *ctx; + struct request *rq; + LIST_HEAD(rq_list); + int bit, queued; + + if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->flags))) + return; + + hctx->run++; + + /* + * Touch any software queue that has pending entries. + */ + for_each_set_bit(bit, hctx->ctx_map, hctx->nr_ctx) { + clear_bit(bit, hctx->ctx_map); + ctx = hctx->ctxs[bit]; + BUG_ON(bit != ctx->index_hw); + + spin_lock(&ctx->lock); + list_splice_tail_init(&ctx->rq_list, &rq_list); + spin_unlock(&ctx->lock); + } + + /* + * If we have previous entries on our dispatch list, grab them + * and stuff them at the front for more fair dispatch. + */ + if (!list_empty_careful(&hctx->dispatch)) { + spin_lock(&hctx->lock); + if (!list_empty(&hctx->dispatch)) + list_splice_init(&hctx->dispatch, &rq_list); + spin_unlock(&hctx->lock); + } + + /* + * Delete and return all entries from our dispatch list + */ + queued = 0; + + /* + * Now process all the entries, sending them to the driver. + */ + while (!list_empty(&rq_list)) { + int ret; + + rq = list_first_entry(&rq_list, struct request, queuelist); + list_del_init(&rq->queuelist); + blk_mq_start_request(rq); + + /* + * Last request in the series. Flag it as such, this + * enables drivers to know when IO should be kicked off, + * if they don't do it on a per-request basis. + * + * Note: the flag isn't the only condition drivers + * should do kick off. If drive is busy, the last + * request might not have the bit set. + */ + if (list_empty(&rq_list)) + rq->cmd_flags |= REQ_END; + + ret = q->mq_ops->queue_rq(hctx, rq); + switch (ret) { + case BLK_MQ_RQ_QUEUE_OK: + queued++; + continue; + case BLK_MQ_RQ_QUEUE_BUSY: + /* + * FIXME: we should have a mechanism to stop the queue + * like blk_stop_queue, otherwise we will waste cpu + * time + */ + list_add(&rq->queuelist, &rq_list); + blk_mq_requeue_request(rq); + break; + default: + pr_err("blk-mq: bad return on queue: %d\n", ret); + rq->errors = -EIO; + case BLK_MQ_RQ_QUEUE_ERROR: + blk_mq_end_io(rq, rq->errors); + break; + } + + if (ret == BLK_MQ_RQ_QUEUE_BUSY) + break; + } + + if (!queued) + hctx->dispatched[0]++; + else if (queued < (1 << (BLK_MQ_MAX_DISPATCH_ORDER - 1))) + hctx->dispatched[ilog2(queued) + 1]++; + + /* + * Any items that need requeuing? Stuff them into hctx->dispatch, + * that is where we will continue on next queue run. + */ + if (!list_empty(&rq_list)) { + spin_lock(&hctx->lock); + list_splice(&rq_list, &hctx->dispatch); + spin_unlock(&hctx->lock); + } +} + +void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) +{ + if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->flags))) + return; + + if (!async) + __blk_mq_run_hw_queue(hctx); + else { + struct request_queue *q = hctx->queue; + + kblockd_schedule_delayed_work(q, &hctx->delayed_work, 0); + } +} + +void blk_mq_run_queues(struct request_queue *q, bool async) +{ + struct blk_mq_hw_ctx *hctx; + int i; + + queue_for_each_hw_ctx(q, hctx, i) { + if ((!blk_mq_hctx_has_pending(hctx) && + list_empty_careful(&hctx->dispatch)) || + test_bit(BLK_MQ_S_STOPPED, &hctx->flags)) + continue; + + blk_mq_run_hw_queue(hctx, async); + } +} +EXPORT_SYMBOL(blk_mq_run_queues); + +void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx) +{ + cancel_delayed_work(&hctx->delayed_work); + set_bit(BLK_MQ_S_STOPPED, &hctx->state); +} +EXPORT_SYMBOL(blk_mq_stop_hw_queue); + +void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx) +{ + clear_bit(BLK_MQ_S_STOPPED, &hctx->state); + __blk_mq_run_hw_queue(hctx); +} +EXPORT_SYMBOL(blk_mq_start_hw_queue); + +void blk_mq_start_stopped_hw_queues(struct request_queue *q) +{ + struct blk_mq_hw_ctx *hctx; + int i; + + queue_for_each_hw_ctx(q, hctx, i) { + if (!test_bit(BLK_MQ_S_STOPPED, &hctx->state)) + continue; + + clear_bit(BLK_MQ_S_STOPPED, &hctx->state); + blk_mq_run_hw_queue(hctx, true); + } +} +EXPORT_SYMBOL(blk_mq_start_stopped_hw_queues); + +static void blk_mq_work_fn(struct work_struct *work) +{ + struct blk_mq_hw_ctx *hctx; + + hctx = container_of(work, struct blk_mq_hw_ctx, delayed_work.work); + __blk_mq_run_hw_queue(hctx); +} + +static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, + struct request *rq) +{ + struct blk_mq_ctx *ctx = rq->mq_ctx; + + list_add_tail(&rq->queuelist, &ctx->rq_list); + blk_mq_hctx_mark_pending(hctx, ctx); + + /* + * We do this early, to ensure we are on the right CPU. + */ + blk_mq_add_timer(rq); +} + +void blk_mq_insert_request(struct request_queue *q, struct request *rq, + bool run_queue) +{ + struct blk_mq_hw_ctx *hctx; + struct blk_mq_ctx *ctx, *current_ctx; + + ctx = rq->mq_ctx; + hctx = q->mq_ops->map_queue(q, ctx->cpu); + + if (rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) { + blk_insert_flush(rq); + } else { + current_ctx = blk_mq_get_ctx(q); + + if (!cpu_online(ctx->cpu)) { + ctx = current_ctx; + hctx = q->mq_ops->map_queue(q, ctx->cpu); + rq->mq_ctx = ctx; + } + spin_lock(&ctx->lock); + __blk_mq_insert_request(hctx, rq); + spin_unlock(&ctx->lock); + + blk_mq_put_ctx(current_ctx); + } + + if (run_queue) + __blk_mq_run_hw_queue(hctx); +} +EXPORT_SYMBOL(blk_mq_insert_request); + +/* + * This is a special version of blk_mq_insert_request to bypass FLUSH request + * check. Should only be used internally. + */ +void blk_mq_run_request(struct request *rq, bool run_queue, bool async) +{ + struct request_queue *q = rq->q; + struct blk_mq_hw_ctx *hctx; + struct blk_mq_ctx *ctx, *current_ctx; + + current_ctx = blk_mq_get_ctx(q); + + ctx = rq->mq_ctx; + if (!cpu_online(ctx->cpu)) { + ctx = current_ctx; + rq->mq_ctx = ctx; + } + hctx = q->mq_ops->map_queue(q, ctx->cpu); + + /* ctx->cpu might be offline */ + spin_lock(&ctx->lock); + __blk_mq_insert_request(hctx, rq); + spin_unlock(&ctx->lock); + + blk_mq_put_ctx(current_ctx); + + if (run_queue) + blk_mq_run_hw_queue(hctx, async); +} + +static void blk_mq_insert_requests(struct request_queue *q, + struct blk_mq_ctx *ctx, + struct list_head *list, + int depth, + bool from_schedule) + +{ + struct blk_mq_hw_ctx *hctx; + struct blk_mq_ctx *current_ctx; + + trace_block_unplug(q, depth, !from_schedule); + + current_ctx = blk_mq_get_ctx(q); + + if (!cpu_online(ctx->cpu)) + ctx = current_ctx; + hctx = q->mq_ops->map_queue(q, ctx->cpu); + + /* + * preemption doesn't flush plug list, so it's possible ctx->cpu is + * offline now + */ + spin_lock(&ctx->lock); + while (!list_empty(list)) { + struct request *rq; + + rq = list_first_entry(list, struct request, queuelist); + list_del_init(&rq->queuelist); + rq->mq_ctx = ctx; + __blk_mq_insert_request(hctx, rq); + } + spin_unlock(&ctx->lock); + + blk_mq_put_ctx(current_ctx); + + blk_mq_run_hw_queue(hctx, from_schedule); +} + +static int plug_ctx_cmp(void *priv, struct list_head *a, struct list_head *b) +{ + struct request *rqa = container_of(a, struct request, queuelist); + struct request *rqb = container_of(b, struct request, queuelist); + + return !(rqa->mq_ctx < rqb->mq_ctx || + (rqa->mq_ctx == rqb->mq_ctx && + blk_rq_pos(rqa) < blk_rq_pos(rqb))); +} + +void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) +{ + struct blk_mq_ctx *this_ctx; + struct request_queue *this_q; + struct request *rq; + LIST_HEAD(list); + LIST_HEAD(ctx_list); + unsigned int depth; + + list_splice_init(&plug->mq_list, &list); + + list_sort(NULL, &list, plug_ctx_cmp); + + this_q = NULL; + this_ctx = NULL; + depth = 0; + + while (!list_empty(&list)) { + rq = list_entry_rq(list.next); + list_del_init(&rq->queuelist); + BUG_ON(!rq->q); + if (rq->mq_ctx != this_ctx) { + if (this_ctx) { + blk_mq_insert_requests(this_q, this_ctx, + &ctx_list, depth, + from_schedule); + } + + this_ctx = rq->mq_ctx; + this_q = rq->q; + depth = 0; + } + + depth++; + list_add_tail(&rq->queuelist, &ctx_list); + } + + /* + * If 'this_ctx' is set, we know we have entries to complete + * on 'ctx_list'. Do those. + */ + if (this_ctx) { + blk_mq_insert_requests(this_q, this_ctx, &ctx_list, depth, + from_schedule); + } +} + +static void blk_mq_bio_to_request(struct request *rq, struct bio *bio) +{ + init_request_from_bio(rq, bio); + blk_account_io_start(rq, 1); +} + +static void blk_mq_make_request(struct request_queue *q, struct bio *bio) +{ + struct blk_mq_hw_ctx *hctx; + struct blk_mq_ctx *ctx; + const int is_sync = rw_is_sync(bio->bi_rw); + const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA); + int rw = bio_data_dir(bio); + struct request *rq; + unsigned int use_plug, request_count = 0; + + /* + * If we have multiple hardware queues, just go directly to + * one of those for sync IO. + */ + use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync); + + blk_queue_bounce(q, &bio); + + if (use_plug && blk_attempt_plug_merge(q, bio, &request_count)) + return; + + if (blk_mq_queue_enter(q)) { + bio_endio(bio, -EIO); + return; + } + + ctx = blk_mq_get_ctx(q); + hctx = q->mq_ops->map_queue(q, ctx->cpu); + + trace_block_getrq(q, bio, rw); + rq = __blk_mq_alloc_request(hctx, GFP_ATOMIC, false); + if (likely(rq)) + blk_mq_rq_ctx_init(ctx, rq, rw); + else { + blk_mq_put_ctx(ctx); + trace_block_sleeprq(q, bio, rw); + rq = blk_mq_alloc_request_pinned(q, rw, __GFP_WAIT|GFP_ATOMIC, + false); + ctx = rq->mq_ctx; + hctx = q->mq_ops->map_queue(q, ctx->cpu); + } + + hctx->queued++; + + if (unlikely(is_flush_fua)) { + blk_mq_bio_to_request(rq, bio); + blk_mq_put_ctx(ctx); + blk_insert_flush(rq); + goto run_queue; + } + + /* + * A task plug currently exists. Since this is completely lockless, + * utilize that to temporarily store requests until the task is + * either done or scheduled away. + */ + if (use_plug) { + struct blk_plug *plug = current->plug; + + if (plug) { + blk_mq_bio_to_request(rq, bio); + if (list_empty(&plug->list)) + trace_block_plug(q); + else if (request_count >= BLK_MAX_REQUEST_COUNT) { + blk_flush_plug_list(plug, false); + trace_block_plug(q); + } + list_add_tail(&rq->queuelist, &plug->mq_list); + blk_mq_put_ctx(ctx); + return; + } + } + + spin_lock(&ctx->lock); + + if ((hctx->flags & BLK_MQ_F_SHOULD_MERGE) && + blk_mq_attempt_merge(q, ctx, bio)) + __blk_mq_free_request(hctx, ctx, rq); + else { + blk_mq_bio_to_request(rq, bio); + __blk_mq_insert_request(hctx, rq); + } + + spin_unlock(&ctx->lock); + blk_mq_put_ctx(ctx); + + /* + * For a SYNC request, send it to the hardware immediately. For an + * ASYNC request, just ensure that we run it later on. The latter + * allows for merging opportunities and more efficient dispatching. + */ +run_queue: + blk_mq_run_hw_queue(hctx, !is_sync || is_flush_fua); +} + +/* + * Default mapping to a software queue, since we use one per CPU. + */ +struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, const int cpu) +{ + return q->queue_hw_ctx[q->mq_map[cpu]]; +} +EXPORT_SYMBOL(blk_mq_map_queue); + +struct blk_mq_hw_ctx *blk_mq_alloc_single_hw_queue(struct blk_mq_reg *reg, + unsigned int hctx_index) +{ + return kmalloc_node(sizeof(struct blk_mq_hw_ctx), + GFP_KERNEL | __GFP_ZERO, reg->numa_node); +} +EXPORT_SYMBOL(blk_mq_alloc_single_hw_queue); + +void blk_mq_free_single_hw_queue(struct blk_mq_hw_ctx *hctx, + unsigned int hctx_index) +{ + kfree(hctx); +} +EXPORT_SYMBOL(blk_mq_free_single_hw_queue); + +static void blk_mq_hctx_notify(void *data, unsigned long action, + unsigned int cpu) +{ + struct blk_mq_hw_ctx *hctx = data; + struct blk_mq_ctx *ctx; + LIST_HEAD(tmp); + + if (action != CPU_DEAD && action != CPU_DEAD_FROZEN) + return; + + /* + * Move ctx entries to new CPU, if this one is going away. + */ + ctx = __blk_mq_get_ctx(hctx->queue, cpu); + + spin_lock(&ctx->lock); + if (!list_empty(&ctx->rq_list)) { + list_splice_init(&ctx->rq_list, &tmp); + clear_bit(ctx->index_hw, hctx->ctx_map); + } + spin_unlock(&ctx->lock); + + if (list_empty(&tmp)) + return; + + ctx = blk_mq_get_ctx(hctx->queue); + spin_lock(&ctx->lock); + + while (!list_empty(&tmp)) { + struct request *rq; + + rq = list_first_entry(&tmp, struct request, queuelist); + rq->mq_ctx = ctx; + list_move_tail(&rq->queuelist, &ctx->rq_list); + } + + blk_mq_hctx_mark_pending(hctx, ctx); + + spin_unlock(&ctx->lock); + blk_mq_put_ctx(ctx); +} + +static void blk_mq_init_hw_commands(struct blk_mq_hw_ctx *hctx, + void (*init)(void *, struct blk_mq_hw_ctx *, + struct request *, unsigned int), + void *data) +{ + unsigned int i; + + for (i = 0; i < hctx->queue_depth; i++) { + struct request *rq = hctx->rqs[i]; + + init(data, hctx, rq, i); + } +} + +void blk_mq_init_commands(struct request_queue *q, + void (*init)(void *, struct blk_mq_hw_ctx *, + struct request *, unsigned int), + void *data) +{ + struct blk_mq_hw_ctx *hctx; + unsigned int i; + + queue_for_each_hw_ctx(q, hctx, i) + blk_mq_init_hw_commands(hctx, init, data); +} +EXPORT_SYMBOL(blk_mq_init_commands); + +static void blk_mq_free_rq_map(struct blk_mq_hw_ctx *hctx) +{ + struct page *page; + + while (!list_empty(&hctx->page_list)) { + page = list_first_entry(&hctx->page_list, struct page, list); + list_del_init(&page->list); + __free_pages(page, page->private); + } + + kfree(hctx->rqs); + + if (hctx->tags) + blk_mq_free_tags(hctx->tags); +} + +static size_t order_to_size(unsigned int order) +{ + size_t ret = PAGE_SIZE; + + while (order--) + ret *= 2; + + return ret; +} + +static int blk_mq_init_rq_map(struct blk_mq_hw_ctx *hctx, + unsigned int reserved_tags, int node) +{ + unsigned int i, j, entries_per_page, max_order = 4; + size_t rq_size, left; + + INIT_LIST_HEAD(&hctx->page_list); + + hctx->rqs = kmalloc_node(hctx->queue_depth * sizeof(struct request *), + GFP_KERNEL, node); + if (!hctx->rqs) + return -ENOMEM; + + /* + * rq_size is the size of the request plus driver payload, rounded + * to the cacheline size + */ + rq_size = round_up(sizeof(struct request) + hctx->cmd_size, + cache_line_size()); + left = rq_size * hctx->queue_depth; + + for (i = 0; i < hctx->queue_depth;) { + int this_order = max_order; + struct page *page; + int to_do; + void *p; + + while (left < order_to_size(this_order - 1) && this_order) + this_order--; + + do { + page = alloc_pages_node(node, GFP_KERNEL, this_order); + if (page) + break; + if (!this_order--) + break; + if (order_to_size(this_order) < rq_size) + break; + } while (1); + + if (!page) + break; + + page->private = this_order; + list_add_tail(&page->list, &hctx->page_list); + + p = page_address(page); + entries_per_page = order_to_size(this_order) / rq_size; + to_do = min(entries_per_page, hctx->queue_depth - i); + left -= to_do * rq_size; + for (j = 0; j < to_do; j++) { + hctx->rqs[i] = p; + blk_mq_rq_init(hctx, hctx->rqs[i]); + p += rq_size; + i++; + } + } + + if (i < (reserved_tags + BLK_MQ_TAG_MIN)) + goto err_rq_map; + else if (i != hctx->queue_depth) { + hctx->queue_depth = i; + pr_warn("%s: queue depth set to %u because of low memory\n", + __func__, i); + } + + hctx->tags = blk_mq_init_tags(hctx->queue_depth, reserved_tags, node); + if (!hctx->tags) { +err_rq_map: + blk_mq_free_rq_map(hctx); + return -ENOMEM; + } + + return 0; +} + +static int blk_mq_init_hw_queues(struct request_queue *q, + struct blk_mq_reg *reg, void *driver_data) +{ + struct blk_mq_hw_ctx *hctx; + unsigned int i, j; + + /* + * Initialize hardware queues + */ + queue_for_each_hw_ctx(q, hctx, i) { + unsigned int num_maps; + int node; + + node = hctx->numa_node; + if (node == NUMA_NO_NODE) + node = hctx->numa_node = reg->numa_node; + + INIT_DELAYED_WORK(&hctx->delayed_work, blk_mq_work_fn); + spin_lock_init(&hctx->lock); + INIT_LIST_HEAD(&hctx->dispatch); + hctx->queue = q; + hctx->queue_num = i; + hctx->flags = reg->flags; + hctx->queue_depth = reg->queue_depth; + hctx->cmd_size = reg->cmd_size; + + blk_mq_init_cpu_notifier(&hctx->cpu_notifier, + blk_mq_hctx_notify, hctx); + blk_mq_register_cpu_notifier(&hctx->cpu_notifier); + + if (blk_mq_init_rq_map(hctx, reg->reserved_tags, node)) + break; + + /* + * Allocate space for all possible cpus to avoid allocation in + * runtime + */ + hctx->ctxs = kmalloc_node(nr_cpu_ids * sizeof(void *), + GFP_KERNEL, node); + if (!hctx->ctxs) + break; + + num_maps = ALIGN(nr_cpu_ids, BITS_PER_LONG) / BITS_PER_LONG; + hctx->ctx_map = kzalloc_node(num_maps * sizeof(unsigned long), + GFP_KERNEL, node); + if (!hctx->ctx_map) + break; + + hctx->nr_ctx_map = num_maps; + hctx->nr_ctx = 0; + + if (reg->ops->init_hctx && + reg->ops->init_hctx(hctx, driver_data, i)) + break; + } + + if (i == q->nr_hw_queues) + return 0; + + /* + * Init failed + */ + queue_for_each_hw_ctx(q, hctx, j) { + if (i == j) + break; + + if (reg->ops->exit_hctx) + reg->ops->exit_hctx(hctx, j); + + blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier); + blk_mq_free_rq_map(hctx); + kfree(hctx->ctxs); + } + + return 1; +} + +static void blk_mq_init_cpu_queues(struct request_queue *q, + unsigned int nr_hw_queues) +{ + unsigned int i; + + for_each_possible_cpu(i) { + struct blk_mq_ctx *__ctx = per_cpu_ptr(q->queue_ctx, i); + struct blk_mq_hw_ctx *hctx; + + memset(__ctx, 0, sizeof(*__ctx)); + __ctx->cpu = i; + spin_lock_init(&__ctx->lock); + INIT_LIST_HEAD(&__ctx->rq_list); + __ctx->queue = q; + + /* If the cpu isn't online, the cpu is mapped to first hctx */ + hctx = q->mq_ops->map_queue(q, i); + hctx->nr_ctx++; + + if (!cpu_online(i)) + continue; + + /* + * Set local node, IFF we have more than one hw queue. If + * not, we remain on the home node of the device + */ + if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE) + hctx->numa_node = cpu_to_node(i); + } +} + +static void blk_mq_map_swqueue(struct request_queue *q) +{ + unsigned int i; + struct blk_mq_hw_ctx *hctx; + struct blk_mq_ctx *ctx; + + queue_for_each_hw_ctx(q, hctx, i) { + hctx->nr_ctx = 0; + } + + /* + * Map software to hardware queues + */ + queue_for_each_ctx(q, ctx, i) { + /* If the cpu isn't online, the cpu is mapped to first hctx */ + hctx = q->mq_ops->map_queue(q, i); + ctx->index_hw = hctx->nr_ctx; + hctx->ctxs[hctx->nr_ctx++] = ctx; + } +} + +struct request_queue *blk_mq_init_queue(struct blk_mq_reg *reg, + void *driver_data) +{ + struct blk_mq_hw_ctx **hctxs; + struct blk_mq_ctx *ctx; + struct request_queue *q; + int i; + + if (!reg->nr_hw_queues || + !reg->ops->queue_rq || !reg->ops->map_queue || + !reg->ops->alloc_hctx || !reg->ops->free_hctx) + return ERR_PTR(-EINVAL); + + if (!reg->queue_depth) + reg->queue_depth = BLK_MQ_MAX_DEPTH; + else if (reg->queue_depth > BLK_MQ_MAX_DEPTH) { + pr_err("blk-mq: queuedepth too large (%u)\n", reg->queue_depth); + reg->queue_depth = BLK_MQ_MAX_DEPTH; + } + + if (reg->queue_depth < (reg->reserved_tags + BLK_MQ_TAG_MIN)) + return ERR_PTR(-EINVAL); + + ctx = alloc_percpu(struct blk_mq_ctx); + if (!ctx) + return ERR_PTR(-ENOMEM); + + hctxs = kmalloc_node(reg->nr_hw_queues * sizeof(*hctxs), GFP_KERNEL, + reg->numa_node); + + if (!hctxs) + goto err_percpu; + + for (i = 0; i < reg->nr_hw_queues; i++) { + hctxs[i] = reg->ops->alloc_hctx(reg, i); + if (!hctxs[i]) + goto err_hctxs; + + hctxs[i]->numa_node = NUMA_NO_NODE; + hctxs[i]->queue_num = i; + } + + q = blk_alloc_queue_node(GFP_KERNEL, reg->numa_node); + if (!q) + goto err_hctxs; + + q->mq_map = blk_mq_make_queue_map(reg); + if (!q->mq_map) + goto err_map; + + setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q); + blk_queue_rq_timeout(q, 30000); + + q->nr_queues = nr_cpu_ids; + q->nr_hw_queues = reg->nr_hw_queues; + + q->queue_ctx = ctx; + q->queue_hw_ctx = hctxs; + + q->mq_ops = reg->ops; + + blk_queue_make_request(q, blk_mq_make_request); + blk_queue_rq_timed_out(q, reg->ops->timeout); + if (reg->timeout) + blk_queue_rq_timeout(q, reg->timeout); + + blk_mq_init_flush(q); + blk_mq_init_cpu_queues(q, reg->nr_hw_queues); + + if (blk_mq_init_hw_queues(q, reg, driver_data)) + goto err_hw; + + blk_mq_map_swqueue(q); + + mutex_lock(&all_q_mutex); + list_add_tail(&q->all_q_node, &all_q_list); + mutex_unlock(&all_q_mutex); + + return q; +err_hw: + kfree(q->mq_map); +err_map: + blk_cleanup_queue(q); +err_hctxs: + for (i = 0; i < reg->nr_hw_queues; i++) { + if (!hctxs[i]) + break; + reg->ops->free_hctx(hctxs[i], i); + } + kfree(hctxs); +err_percpu: + free_percpu(ctx); + return ERR_PTR(-ENOMEM); +} +EXPORT_SYMBOL(blk_mq_init_queue); + +void blk_mq_free_queue(struct request_queue *q) +{ + struct blk_mq_hw_ctx *hctx; + int i; + + queue_for_each_hw_ctx(q, hctx, i) { + cancel_delayed_work_sync(&hctx->delayed_work); + kfree(hctx->ctx_map); + kfree(hctx->ctxs); + blk_mq_free_rq_map(hctx); + blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier); + if (q->mq_ops->exit_hctx) + q->mq_ops->exit_hctx(hctx, i); + q->mq_ops->free_hctx(hctx, i); + } + + free_percpu(q->queue_ctx); + kfree(q->queue_hw_ctx); + kfree(q->mq_map); + + q->queue_ctx = NULL; + q->queue_hw_ctx = NULL; + q->mq_map = NULL; + + mutex_lock(&all_q_mutex); + list_del_init(&q->all_q_node); + mutex_unlock(&all_q_mutex); +} +EXPORT_SYMBOL(blk_mq_free_queue); + +/* Basically redo blk_mq_init_queue with queue frozen */ +static void __cpuinit blk_mq_queue_reinit(struct request_queue *q) +{ + blk_mq_freeze_queue(q); + + blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues); + + /* + * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe + * we should change hctx numa_node according to new topology (this + * involves free and re-allocate memory, worthy doing?) + */ + + blk_mq_map_swqueue(q); + + blk_mq_unfreeze_queue(q); +} + +static int __cpuinit blk_mq_queue_reinit_notify(struct notifier_block *nb, + unsigned long action, void *hcpu) +{ + struct request_queue *q; + + /* + * Before new mapping is established, hotadded cpu might already start + * handling requests. This doesn't break anything as we map offline + * CPUs to first hardware queue. We will re-init queue below to get + * optimal settings. + */ + if (action != CPU_DEAD && action != CPU_DEAD_FROZEN && + action != CPU_ONLINE && action != CPU_ONLINE_FROZEN) + return NOTIFY_OK; + + mutex_lock(&all_q_mutex); + list_for_each_entry(q, &all_q_list, all_q_node) + blk_mq_queue_reinit(q); + mutex_unlock(&all_q_mutex); + return NOTIFY_OK; +} + +static int __init blk_mq_init(void) +{ + unsigned int i; + + for_each_possible_cpu(i) + init_llist_head(&per_cpu(ipi_lists, i)); + + blk_mq_cpu_init(); + + /* Must be called after percpu_counter_hotcpu_callback() */ + hotcpu_notifier(blk_mq_queue_reinit_notify, -10); + + return 0; +} +subsys_initcall(blk_mq_init); diff --git a/block/blk-mq.h b/block/blk-mq.h new file mode 100644 index 000000000000..52bf1f96a2c2 --- /dev/null +++ b/block/blk-mq.h @@ -0,0 +1,52 @@ +#ifndef INT_BLK_MQ_H +#define INT_BLK_MQ_H + +struct blk_mq_ctx { + struct { + spinlock_t lock; + struct list_head rq_list; + } ____cacheline_aligned_in_smp; + + unsigned int cpu; + unsigned int index_hw; + unsigned int ipi_redirect; + + /* incremented at dispatch time */ + unsigned long rq_dispatched[2]; + unsigned long rq_merged; + + /* incremented at completion time */ + unsigned long ____cacheline_aligned_in_smp rq_completed[2]; + + struct request_queue *queue; + struct kobject kobj; +}; + +void __blk_mq_end_io(struct request *rq, int error); +void blk_mq_complete_request(struct request *rq, int error); +void blk_mq_run_request(struct request *rq, bool run_queue, bool async); +void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); +void blk_mq_init_flush(struct request_queue *q); + +/* + * CPU hotplug helpers + */ +struct blk_mq_cpu_notifier; +void blk_mq_init_cpu_notifier(struct blk_mq_cpu_notifier *notifier, + void (*fn)(void *, unsigned long, unsigned int), + void *data); +void blk_mq_register_cpu_notifier(struct blk_mq_cpu_notifier *notifier); +void blk_mq_unregister_cpu_notifier(struct blk_mq_cpu_notifier *notifier); +void blk_mq_cpu_init(void); +DECLARE_PER_CPU(struct llist_head, ipi_lists); + +/* + * CPU -> queue mappings + */ +struct blk_mq_reg; +extern unsigned int *blk_mq_make_queue_map(struct blk_mq_reg *reg); +extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues); + +void blk_mq_add_timer(struct request *rq); + +#endif diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 3aa5b195f4dd..4f8c4d90ec73 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -7,6 +7,7 @@ #include #include #include +#include #include "blk.h" #include "blk-cgroup.h" @@ -542,6 +543,11 @@ static void blk_release_queue(struct kobject *kobj) if (q->queue_tags) __blk_queue_free_tags(q); + percpu_counter_destroy(&q->mq_usage_counter); + + if (q->mq_ops) + blk_mq_free_queue(q); + blk_trace_shutdown(q); bdi_destroy(&q->backing_dev_info); @@ -575,6 +581,7 @@ int blk_register_queue(struct gendisk *disk) * bypass from queue allocation. */ blk_queue_bypass_end(q); + queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q); ret = blk_trace_init_sysfs(dev); if (ret) @@ -588,6 +595,9 @@ int blk_register_queue(struct gendisk *disk) kobject_uevent(&q->kobj, KOBJ_ADD); + if (q->mq_ops) + blk_mq_register_disk(disk); + if (!q->request_fn) return 0; @@ -610,6 +620,9 @@ void blk_unregister_queue(struct gendisk *disk) if (WARN_ON(!q)) return; + if (q->mq_ops) + blk_mq_unregister_disk(disk); + if (q->request_fn) elv_unregister_queue(q); diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 65f103563969..22846cf3595a 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -7,6 +7,7 @@ #include #include "blk.h" +#include "blk-mq.h" #ifdef CONFIG_FAIL_IO_TIMEOUT @@ -88,11 +89,18 @@ static void blk_rq_timed_out(struct request *req) ret = q->rq_timed_out_fn(req); switch (ret) { case BLK_EH_HANDLED: - __blk_complete_request(req); + /* Can we use req->errors here? */ + if (q->mq_ops) + blk_mq_complete_request(req, req->errors); + else + __blk_complete_request(req); break; case BLK_EH_RESET_TIMER: blk_clear_rq_complete(req); - blk_add_timer(req); + if (q->mq_ops) + blk_mq_add_timer(req); + else + blk_add_timer(req); break; case BLK_EH_NOT_HANDLED: /* @@ -108,6 +116,23 @@ static void blk_rq_timed_out(struct request *req) } } +void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout, + unsigned int *next_set) +{ + if (time_after_eq(jiffies, rq->deadline)) { + list_del_init(&rq->timeout_list); + + /* + * Check if we raced with end io completion + */ + if (!blk_mark_rq_complete(rq)) + blk_rq_timed_out(rq); + } else if (!*next_set || time_after(*next_timeout, rq->deadline)) { + *next_timeout = rq->deadline; + *next_set = 1; + } +} + void blk_rq_timed_out_timer(unsigned long data) { struct request_queue *q = (struct request_queue *) data; @@ -117,21 +142,8 @@ void blk_rq_timed_out_timer(unsigned long data) spin_lock_irqsave(q->queue_lock, flags); - list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list) { - if (time_after_eq(jiffies, rq->deadline)) { - list_del_init(&rq->timeout_list); - - /* - * Check if we raced with end io completion - */ - if (blk_mark_rq_complete(rq)) - continue; - blk_rq_timed_out(rq); - } else if (!next_set || time_after(next, rq->deadline)) { - next = rq->deadline; - next_set = 1; - } - } + list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list) + blk_rq_check_expired(rq, &next, &next_set); if (next_set) mod_timer(&q->timeout, round_jiffies_up(next)); @@ -157,15 +169,7 @@ void blk_abort_request(struct request *req) } EXPORT_SYMBOL_GPL(blk_abort_request); -/** - * blk_add_timer - Start timeout timer for a single request - * @req: request that is about to start running. - * - * Notes: - * Each request has its own timer, and as it is added to the queue, we - * set up the timer. When the request completes, we cancel the timer. - */ -void blk_add_timer(struct request *req) +void __blk_add_timer(struct request *req, struct list_head *timeout_list) { struct request_queue *q = req->q; unsigned long expiry; @@ -184,7 +188,8 @@ void blk_add_timer(struct request *req) req->timeout = q->rq_timeout; req->deadline = jiffies + req->timeout; - list_add_tail(&req->timeout_list, &q->timeout_list); + if (timeout_list) + list_add_tail(&req->timeout_list, timeout_list); /* * If the timer isn't already pending or this timeout is earlier @@ -196,5 +201,19 @@ void blk_add_timer(struct request *req) if (!timer_pending(&q->timeout) || time_before(expiry, q->timeout.expires)) mod_timer(&q->timeout, expiry); + +} + +/** + * blk_add_timer - Start timeout timer for a single request + * @req: request that is about to start running. + * + * Notes: + * Each request has its own timer, and as it is added to the queue, we + * set up the timer. When the request completes, we cancel the timer. + */ +void blk_add_timer(struct request *req) +{ + __blk_add_timer(req, &req->q->timeout_list); } diff --git a/block/blk.h b/block/blk.h index e837b8f619b7..c90e1d8f7a2b 100644 --- a/block/blk.h +++ b/block/blk.h @@ -10,6 +10,7 @@ #define BLK_BATCH_REQ 32 extern struct kmem_cache *blk_requestq_cachep; +extern struct kmem_cache *request_cachep; extern struct kobj_type blk_queue_ktype; extern struct ida blk_queue_ida; @@ -34,14 +35,30 @@ bool __blk_end_bidi_request(struct request *rq, int error, unsigned int nr_bytes, unsigned int bidi_bytes); void blk_rq_timed_out_timer(unsigned long data); +void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout, + unsigned int *next_set); +void __blk_add_timer(struct request *req, struct list_head *timeout_list); void blk_delete_timer(struct request *); void blk_add_timer(struct request *); + +bool bio_attempt_front_merge(struct request_queue *q, struct request *req, + struct bio *bio); +bool bio_attempt_back_merge(struct request_queue *q, struct request *req, + struct bio *bio); +bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, + unsigned int *request_count); + +void blk_account_io_start(struct request *req, bool new_io); +void blk_account_io_completion(struct request *req, unsigned int bytes); +void blk_account_io_done(struct request *req); + /* * Internal atomic flags for request handling */ enum rq_atomic_flags { REQ_ATOM_COMPLETE = 0, + REQ_ATOM_STARTED, }; /* -- cgit v1.2.3 From 280d45f6c35d8d7a0fe20c36caf426e3ac139cf9 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 25 Oct 2013 14:45:58 +0100 Subject: blk-mq: add blk_mq_stop_hw_queues Add a helper to iterate over all hw queues and stop them. This is useful for driver that implement PM suspend functionality. Signed-off-by: Christoph Hellwig Modified to just call blk_mq_stop_hw_queue() by Jens. Signed-off-by: Jens Axboe --- block/blk-mq.c | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'block') diff --git a/block/blk-mq.c b/block/blk-mq.c index f21ec964e411..ac804c635040 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -672,6 +672,16 @@ void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx) } EXPORT_SYMBOL(blk_mq_stop_hw_queue); +void blk_mq_stop_hw_queues(struct request_queue *q) +{ + struct blk_mq_hw_ctx *hctx; + int i; + + queue_for_each_hw_ctx(q, hctx, i) + blk_mq_stop_hw_queue(hctx); +} +EXPORT_SYMBOL(blk_mq_stop_hw_queues); + void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx) { clear_bit(BLK_MQ_S_STOPPED, &hctx->state); -- cgit v1.2.3 From 3228f48be2d19b2dd90db96ec16a40187a2946f3 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 28 Oct 2013 13:33:58 -0600 Subject: blk-mq: fix for flush deadlock The flush state machine takes in a struct request, which then is submitted multiple times to the underling driver. The old block code requeses the same request for each of those, so it does not have an issue with tapping into the request pool. The new one on the other hand allocates a new request for each of the actualy steps of the flush sequence. If have already allocated all of the tags for IO, we will fail allocating the flush request. Set aside a reserved request just for flushes. Signed-off-by: Jens Axboe --- block/blk-core.c | 2 +- block/blk-flush.c | 2 +- block/blk-mq.c | 14 ++++++++++++-- 3 files changed, 14 insertions(+), 4 deletions(-) (limited to 'block') diff --git a/block/blk-core.c b/block/blk-core.c index 3bb9e9f7f87e..9677c6525ed8 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1102,7 +1102,7 @@ static struct request *blk_old_get_request(struct request_queue *q, int rw, struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask) { if (q->mq_ops) - return blk_mq_alloc_request(q, rw, gfp_mask); + return blk_mq_alloc_request(q, rw, gfp_mask, false); else return blk_old_get_request(q, rw, gfp_mask); } diff --git a/block/blk-flush.c b/block/blk-flush.c index 3e4cc9c7890a..331e627301ea 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -286,7 +286,7 @@ static void mq_flush_work(struct work_struct *work) /* We don't need set REQ_FLUSH_SEQ, it's for consistency */ rq = blk_mq_alloc_request(q, WRITE_FLUSH|REQ_FLUSH_SEQ, - __GFP_WAIT|GFP_ATOMIC); + __GFP_WAIT|GFP_ATOMIC, true); rq->cmd_type = REQ_TYPE_FS; rq->end_io = flush_end_io; diff --git a/block/blk-mq.c b/block/blk-mq.c index ac804c635040..2dc8de86d0d2 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -210,14 +210,15 @@ static struct request *blk_mq_alloc_request_pinned(struct request_queue *q, return rq; } -struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp) +struct request *blk_mq_alloc_request(struct request_queue *q, int rw, + gfp_t gfp, bool reserved) { struct request *rq; if (blk_mq_queue_enter(q)) return NULL; - rq = blk_mq_alloc_request_pinned(q, rw, gfp, false); + rq = blk_mq_alloc_request_pinned(q, rw, gfp, reserved); blk_mq_put_ctx(rq->mq_ctx); return rq; } @@ -1327,6 +1328,15 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_reg *reg, reg->queue_depth = BLK_MQ_MAX_DEPTH; } + /* + * Set aside a tag for flush requests. It will only be used while + * another flush request is in progress but outside the driver. + * + * TODO: only allocate if flushes are supported + */ + reg->queue_depth++; + reg->reserved_tags++; + if (reg->queue_depth < (reg->reserved_tags + BLK_MQ_TAG_MIN)) return ERR_PTR(-EINVAL); -- cgit v1.2.3 From 92f399c72af2d8cbb9d4f60e11d0d67ca738147f Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Tue, 29 Oct 2013 12:01:03 -0600 Subject: blk-mq: mq plug list breakage We switched to plug mq_list for mq, but some code are still using old list. Signed-off-by: Shaohua Li Signed-off-by: Jens Axboe --- block/blk-core.c | 8 +++++++- block/blk-mq.c | 2 +- 2 files changed, 8 insertions(+), 2 deletions(-) (limited to 'block') diff --git a/block/blk-core.c b/block/blk-core.c index 9677c6525ed8..936876ab662d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1401,13 +1401,19 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, struct blk_plug *plug; struct request *rq; bool ret = false; + struct list_head *plug_list; plug = current->plug; if (!plug) goto out; *request_count = 0; - list_for_each_entry_reverse(rq, &plug->list, queuelist) { + if (q->mq_ops) + plug_list = &plug->mq_list; + else + plug_list = &plug->list; + + list_for_each_entry_reverse(rq, plug_list, queuelist) { int el_ret; if (rq->q == q) diff --git a/block/blk-mq.c b/block/blk-mq.c index 2dc8de86d0d2..88d4e864d4c0 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -950,7 +950,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio) if (plug) { blk_mq_bio_to_request(rq, bio); - if (list_empty(&plug->list)) + if (list_empty(&plug->mq_list)) trace_block_plug(q); else if (request_count >= BLK_MAX_REQUEST_COUNT) { blk_flush_plug_list(plug, false); -- cgit v1.2.3 From e7e245000110a7794de8f925b9edc06a9c852f80 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Tue, 29 Oct 2013 12:11:47 -0600 Subject: blk-mq: don't disallow request merges for req->special being set For blk-mq, if a driver has requested per-request payload data to carry command structures, they are stuffed into req->special. For an old style request based driver, req->special is used for the same purpose but indicates that a per-driver request structure has been prepared for the request already. So for the old style driver, we do not merge such requests. As most/all blk-mq drivers will use the payload feature, and since we have no problem merging on these, make this check dependent on whether it's a blk-mq enabled driver or not. Reported-by: Shaohua Li Signed-off-by: Jens Axboe --- block/blk-merge.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) (limited to 'block') diff --git a/block/blk-merge.c b/block/blk-merge.c index 5f2448253797..1ffc58977835 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -308,6 +308,17 @@ int ll_front_merge_fn(struct request_queue *q, struct request *req, return ll_new_hw_segment(q, req, bio); } +/* + * blk-mq uses req->special to carry normal driver per-request payload, it + * does not indicate a prepared command that we cannot merge with. + */ +static bool req_no_special_merge(struct request *req) +{ + struct request_queue *q = req->q; + + return !q->mq_ops && req->special; +} + static int ll_merge_requests_fn(struct request_queue *q, struct request *req, struct request *next) { @@ -319,7 +330,7 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req, * First check if the either of the requests are re-queued * requests. Can't merge them if they are. */ - if (req->special || next->special) + if (req_no_special_merge(req) || req_no_special_merge(next)) return 0; /* @@ -416,7 +427,7 @@ static int attempt_merge(struct request_queue *q, struct request *req, if (rq_data_dir(req) != rq_data_dir(next) || req->rq_disk != next->rq_disk - || next->special) + || req_no_special_merge(next)) return 0; if (req->cmd_flags & REQ_WRITE_SAME && @@ -515,7 +526,7 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio) return false; /* must be same device and not a special request */ - if (rq->rq_disk != bio->bi_bdev->bd_disk || rq->special) + if (rq->rq_disk != bio->bi_bdev->bd_disk || req_no_special_merge(rq)) return false; /* only merge integrity protected bio into ditto rq */ -- cgit v1.2.3 From 4912aa6c11e6a5d910264deedbec2075c6f1bb73 Mon Sep 17 00:00:00 2001 From: Jeff Moyer Date: Tue, 8 Oct 2013 14:36:41 -0400 Subject: block: fix race between request completion and timeout handling crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461 RIP: 0010:[] [] blk_requeue_request+0x94/0xa0 RSP: 0018:ffff881057eefd60 EFLAGS: 00010012 RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8 RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780 RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338 R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000 FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540) Stack: 0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000 <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393 <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90 Call Trace: [] __scsi_queue_insert+0xa3/0x150 [] ? scsi_eh_ready_devs+0x5e3/0x850 [] scsi_queue_insert+0x13/0x20 [] scsi_eh_flush_done_q+0x104/0x160 [] scsi_error_handler+0x35b/0x660 [] ? scsi_error_handler+0x0/0x660 [] kthread+0x96/0xa0 [] child_rip+0xa/0x20 [] ? kthread+0x0/0xa0 [] ? child_rip+0x0/0x20 Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 RIP [] blk_requeue_request+0x94/0xa0 RSP The RIP is this line: BUG_ON(blk_queued_rq(rq)); After digging through the code, I think there may be a race between the request completion and the timer handler running. A timer is started for each request put on the device's queue (see blk_start_request->blk_add_timer). If the request does not complete before the timer expires, the timer handler (blk_rq_timed_out_timer) will mark the request complete atomically: static inline int blk_mark_rq_complete(struct request *rq) { return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags); } and then call blk_rq_timed_out. The latter function will call scsi_times_out, which will return one of BLK_EH_HANDLED, BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is returned, blk_clear_rq_complete is called, and blk_add_timer is again called to simply wait longer for the request to complete. Now, if the request happens to complete while this is going on, what happens? Given that we know the completion handler will bail if it finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion handler running after that bit is cleared. So, from the above paragraph, after the call to blk_clear_rq_complete. If the completion sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom there (I haven't seen this in the cores). Next, if we get the completion before the call to list_add_tail, then the timer will eventually fire for an old req, which may either be freed or reallocated (there is evidence that this might be the case). Finally, if the completion comes in *after* the addition to the timeout list, I think it's harmless. The request will be removed from the timeout list, req_atom_complete will be set, and all will be well. This will only actually explain the coredumps *IF* the request structure was freed, reallocated *and* queued before the error handler thread had a chance to process it. That is possible, but it may make sense to keep digging for another race. I think that if this is what was happening, we would see other instances of this problem showing up as null pointer or garbage pointer dereferences, for example when the request structure was not re-used. It looks like we actually do run into that situation in other reports. This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); from blk_add_timer to the only caller that could trip over it (blk_start_request). It then inverts the calls to blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address the race. I've boot tested this patch, but nothing more. Signed-off-by: Jeff Moyer Acked-by: Hannes Reinecke Cc: stable@kernel.org Signed-off-by: Jens Axboe --- block/blk-core.c | 1 + block/blk-timeout.c | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'block') diff --git a/block/blk-core.c b/block/blk-core.c index 0a00e4ecf87c..5e00b5a58f6a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2227,6 +2227,7 @@ void blk_start_request(struct request *req) if (unlikely(blk_bidi_rq(req))) req->next_rq->resid_len = blk_rq_bytes(req->next_rq); + BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); blk_add_timer(req); } EXPORT_SYMBOL(blk_start_request); diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 65f103563969..655ba909cd6a 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -91,8 +91,8 @@ static void blk_rq_timed_out(struct request *req) __blk_complete_request(req); break; case BLK_EH_RESET_TIMER: - blk_clear_rq_complete(req); blk_add_timer(req); + blk_clear_rq_complete(req); break; case BLK_EH_NOT_HANDLED: /* @@ -174,7 +174,6 @@ void blk_add_timer(struct request *req) return; BUG_ON(!list_empty(&req->timeout_list)); - BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags)); /* * Some LLDs, like scsi, peek at the timeout to prevent a -- cgit v1.2.3 From fff4996b7db7955414ac74386efa5e07fd766b50 Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Mon, 14 Oct 2013 12:11:36 -0400 Subject: blk-core: Fix memory corruption if blkcg_init_queue fails If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy to clean up structures allocated by the backing dev. ------------[ cut here ]------------ WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0() ODEBUG: free active (active state 0) object type: percpu_counter hint: (null) Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix CPU: 0 PID: 2739 Comm: lvchange Tainted: G W 3.10.15-devel #14 Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009 0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20 ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8 ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67 Call Trace: [] dump_stack+0x19/0x1b [] warn_slowpath_common+0x6b/0xa0 [] warn_slowpath_fmt+0x47/0x50 [] ? debug_check_no_obj_freed+0xcf/0x250 [] debug_print_object+0x85/0xa0 [] debug_check_no_obj_freed+0x203/0x250 [] kmem_cache_free+0x20c/0x3a0 [] blk_alloc_queue_node+0x2a9/0x2c0 [] blk_alloc_queue+0xe/0x10 [] dm_create+0x1a3/0x530 [dm_mod] [] ? list_version_get_info+0xe0/0xe0 [dm_mod] [] dev_create+0x57/0x2b0 [dm_mod] [] ? list_version_get_info+0xe0/0xe0 [dm_mod] [] ? list_version_get_info+0xe0/0xe0 [dm_mod] [] ctl_ioctl+0x268/0x500 [dm_mod] [] ? get_lock_stats+0x22/0x70 [] dm_ctl_ioctl+0xe/0x20 [dm_mod] [] do_vfs_ioctl+0x2ed/0x520 [] ? fget_light+0x377/0x4e0 [] SyS_ioctl+0x4b/0x90 [] system_call_fastpath+0x1a/0x1f ---[ end trace 4b5ff0d55673d986 ]--- ------------[ cut here ]------------ This fix should be backported to stable kernels starting with 2.6.37. Note that in the kernels prior to 3.5 the affected code is different, but the bug is still there - bdi_init is called and bdi_destroy isn't. Signed-off-by: Mikulas Patocka Acked-by: Tejun Heo Cc: stable@kernel.org # 2.6.37+ Signed-off-by: Jens Axboe --- block/blk-core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'block') diff --git a/block/blk-core.c b/block/blk-core.c index 5e00b5a58f6a..0c611d89d748 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -645,10 +645,12 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) __set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags); if (blkcg_init_queue(q)) - goto fail_id; + goto fail_bdi; return q; +fail_bdi: + bdi_destroy(&q->backing_dev_info); fail_id: ida_simple_remove(&blk_queue_ida, q->id); fail_q: -- cgit v1.2.3 From 170d800af83f3ab2b5ced0e370a861e023dee22a Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Tue, 15 Oct 2013 12:22:29 -0600 Subject: block: Replace __get_cpu_var uses __get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : #define __get_cpu_var(var) (*this_cpu_ptr(&(var))) __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu variables. This patch converts __get_cpu_var into either an explicit address calculation using this_cpu_ptr() or into a use of this_cpu operations that use the offset. Thereby address calculations are avoided and less registers are used when code is generated. At the end of the patch set all uses of __get_cpu_var have been removed so the macro is removed too. The patch set includes passes over all arches as well. Once these operations are used throughout then specialized macros can be defined in non -x86 arches as well in order to optimize per cpu access by f.e. using a global register that may be set to the per cpu base. Transformations done to __get_cpu_var() 1. Determine the address of the percpu instance of the current processor. DEFINE_PER_CPU(int, y); int *x = &__get_cpu_var(y); Converts to int *x = this_cpu_ptr(&y); 2. Same as #1 but this time an array structure is involved. DEFINE_PER_CPU(int, y[20]); int *x = __get_cpu_var(y); Converts to int *x = this_cpu_ptr(y); 3. Retrieve the content of the current processors instance of a per cpu variable. DEFINE_PER_CPU(int, y); int x = __get_cpu_var(y) Converts to int x = __this_cpu_read(y); 4. Retrieve the content of a percpu struct DEFINE_PER_CPU(struct mystruct, y); struct mystruct x = __get_cpu_var(y); Converts to memcpy(&x, this_cpu_ptr(&y), sizeof(x)); 5. Assignment to a per cpu variable DEFINE_PER_CPU(int, y) __get_cpu_var(y) = x; Converts to this_cpu_write(y, x); 6. Increment/Decrement etc of a per cpu variable DEFINE_PER_CPU(int, y); __get_cpu_var(y)++ Converts to this_cpu_inc(y) Signed-off-by: Christoph Lameter Signed-off-by: Jens Axboe --- block/blk-iopoll.c | 6 +++--- block/blk-softirq.c | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) (limited to 'block') diff --git a/block/blk-iopoll.c b/block/blk-iopoll.c index 4b8d9b541112..1855bf51edb0 100644 --- a/block/blk-iopoll.c +++ b/block/blk-iopoll.c @@ -35,7 +35,7 @@ void blk_iopoll_sched(struct blk_iopoll *iop) unsigned long flags; local_irq_save(flags); - list_add_tail(&iop->list, &__get_cpu_var(blk_cpu_iopoll)); + list_add_tail(&iop->list, this_cpu_ptr(&blk_cpu_iopoll)); __raise_softirq_irqoff(BLOCK_IOPOLL_SOFTIRQ); local_irq_restore(flags); } @@ -79,7 +79,7 @@ EXPORT_SYMBOL(blk_iopoll_complete); static void blk_iopoll_softirq(struct softirq_action *h) { - struct list_head *list = &__get_cpu_var(blk_cpu_iopoll); + struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll); int rearm = 0, budget = blk_iopoll_budget; unsigned long start_time = jiffies; @@ -201,7 +201,7 @@ static int blk_iopoll_cpu_notify(struct notifier_block *self, local_irq_disable(); list_splice_init(&per_cpu(blk_cpu_iopoll, cpu), - &__get_cpu_var(blk_cpu_iopoll)); + this_cpu_ptr(&blk_cpu_iopoll)); __raise_softirq_irqoff(BLOCK_IOPOLL_SOFTIRQ); local_irq_enable(); } diff --git a/block/blk-softirq.c b/block/blk-softirq.c index ec9e60636f43..ce4b8bfd3d27 100644 --- a/block/blk-softirq.c +++ b/block/blk-softirq.c @@ -23,7 +23,7 @@ static void blk_done_softirq(struct softirq_action *h) struct list_head *cpu_list, local_list; local_irq_disable(); - cpu_list = &__get_cpu_var(blk_cpu_done); + cpu_list = this_cpu_ptr(&blk_cpu_done); list_replace_init(cpu_list, &local_list); local_irq_enable(); @@ -44,7 +44,7 @@ static void trigger_softirq(void *data) struct list_head *list; local_irq_save(flags); - list = &__get_cpu_var(blk_cpu_done); + list = this_cpu_ptr(&blk_cpu_done); list_add_tail(&rq->csd.list, list); if (list->next == &rq->csd.list) @@ -90,7 +90,7 @@ static int blk_cpu_notify(struct notifier_block *self, unsigned long action, local_irq_disable(); list_splice_init(&per_cpu(blk_cpu_done, cpu), - &__get_cpu_var(blk_cpu_done)); + this_cpu_ptr(&blk_cpu_done)); raise_softirq_irqoff(BLOCK_SOFTIRQ); local_irq_enable(); } @@ -135,7 +135,7 @@ void __blk_complete_request(struct request *req) if (ccpu == cpu || shared) { struct list_head *list; do_local: - list = &__get_cpu_var(blk_cpu_done); + list = this_cpu_ptr(&blk_cpu_done); list_add_tail(&req->csd.list, list); /* -- cgit v1.2.3 From eb1c160b22655fd4ec44be732d6594fd1b1e44f4 Mon Sep 17 00:00:00 2001 From: Tomoki Sekiyama Date: Tue, 15 Oct 2013 16:42:16 -0600 Subject: elevator: Fix a race in elevator switching and md device initialization The soft lockup below happens at the boot time of the system using dm multipath and the udev rules to switch scheduler. [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483] [ 356.127001] RIP: 0010:[] [] lock_timer_base.isra.35+0x1d/0x50 ... [ 356.127001] Call Trace: [ 356.127001] [] try_to_del_timer_sync+0x20/0x70 [ 356.127001] [] ? kmem_cache_alloc_node_trace+0x20a/0x230 [ 356.127001] [] del_timer_sync+0x52/0x60 [ 356.127001] [] cfq_exit_queue+0x32/0xf0 [ 356.127001] [] elevator_exit+0x2f/0x50 [ 356.127001] [] elevator_change+0xf1/0x1c0 [ 356.127001] [] elv_iosched_store+0x20/0x50 [ 356.127001] [] queue_attr_store+0x59/0xb0 [ 356.127001] [] sysfs_write_file+0xc6/0x140 [ 356.127001] [] vfs_write+0xbd/0x1e0 [ 356.127001] [] SyS_write+0x49/0xa0 [ 356.127001] [] system_call_fastpath+0x16/0x1b This is caused by a race between md device initialization by multipathd and shell script to switch the scheduler using sysfs. - multipathd: SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init q->elevator = elevator_alloc(q, e); // not yet initialized - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler': elevator_switch (in the call trace above) struct elevator_queue *old = q->elevator; q->elevator = elevator_alloc(q, new_e); elevator_exit(old); // lockup! (*) - multipathd: (cont.) err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely while timer->base == NULL. In this case, as timer will never initialized, it results in lockup. This patch introduces acquisition of q->sysfs_lock around elevator_init() into blk_init_allocated_queue(), to provide mutual exclusion between initialization of the q->scheduler and switching of the scheduler. This should fix this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=902012 Signed-off-by: Tomoki Sekiyama Signed-off-by: Jens Axboe --- block/blk-core.c | 10 +++++++++- block/elevator.c | 6 ++++++ 2 files changed, 15 insertions(+), 1 deletion(-) (limited to 'block') diff --git a/block/blk-core.c b/block/blk-core.c index 0c611d89d748..fce4b9387f36 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -741,9 +741,17 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, q->sg_reserved_size = INT_MAX; + /* Protect q->elevator from elevator_change */ + mutex_lock(&q->sysfs_lock); + /* init elevator */ - if (elevator_init(q, NULL)) + if (elevator_init(q, NULL)) { + mutex_unlock(&q->sysfs_lock); return NULL; + } + + mutex_unlock(&q->sysfs_lock); + return q; } EXPORT_SYMBOL(blk_init_allocated_queue); diff --git a/block/elevator.c b/block/elevator.c index 2bcbd8cc14d4..249cc5e10906 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -186,6 +186,12 @@ int elevator_init(struct request_queue *q, char *name) struct elevator_type *e = NULL; int err; + /* + * q->sysfs_lock must be held to provide mutual exclusion between + * elevator_switch() and here. + */ + lockdep_assert_held(&q->sysfs_lock); + if (unlikely(q->elevator)) return 0; -- cgit v1.2.3 From 7c8a3679e3d8e9d92d58f282161760a0e247df97 Mon Sep 17 00:00:00 2001 From: Tomoki Sekiyama Date: Tue, 15 Oct 2013 16:42:19 -0600 Subject: elevator: acquire q->sysfs_lock in elevator_change() Add locking of q->sysfs_lock into elevator_change() (an exported function) to ensure it is held to protect q->elevator from elevator_init(), even if elevator_change() is called from non-sysfs paths. sysfs path (elv_iosched_store) uses __elevator_change(), non-locking version, as the lock is already taken by elv_iosched_store(). Signed-off-by: Tomoki Sekiyama Signed-off-by: Jens Axboe --- block/elevator.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) (limited to 'block') diff --git a/block/elevator.c b/block/elevator.c index 249cc5e10906..b7ff2861b6bd 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -965,7 +965,7 @@ fail_init: /* * Switch this queue to the given IO scheduler. */ -int elevator_change(struct request_queue *q, const char *name) +static int __elevator_change(struct request_queue *q, const char *name) { char elevator_name[ELV_NAME_MAX]; struct elevator_type *e; @@ -987,6 +987,18 @@ int elevator_change(struct request_queue *q, const char *name) return elevator_switch(q, e); } + +int elevator_change(struct request_queue *q, const char *name) +{ + int ret; + + /* Protect q->elevator from elevator_init() */ + mutex_lock(&q->sysfs_lock); + ret = __elevator_change(q, name); + mutex_unlock(&q->sysfs_lock); + + return ret; +} EXPORT_SYMBOL(elevator_change); ssize_t elv_iosched_store(struct request_queue *q, const char *name, @@ -997,7 +1009,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name, if (!q->elevator) return count; - ret = elevator_change(q, name); + ret = __elevator_change(q, name); if (!ret) return count; -- cgit v1.2.3 From d82ae52e68892338068e7559a0c0657193341ce4 Mon Sep 17 00:00:00 2001 From: Mike Snitzer Date: Fri, 18 Oct 2013 09:44:49 -0600 Subject: block: properly stack underlying max_segment_size to DM device Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE (65536) even if the underlying device(s) have a larger value -- this is due to blk_stack_limits() using min_not_zero() when stacking the max_segment_size limit. 1073741824 before patch: 65536 after patch: 1073741824 Reported-by: Lukasz Flis Signed-off-by: Mike Snitzer Cc: stable@vger.kernel.org # v3.3+ Signed-off-by: Jens Axboe --- block/blk-settings.c | 1 + 1 file changed, 1 insertion(+) (limited to 'block') diff --git a/block/blk-settings.c b/block/blk-settings.c index c50ecf0ea3b1..53309333c2f0 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -144,6 +144,7 @@ void blk_set_stacking_limits(struct queue_limits *lim) lim->discard_zeroes_data = 1; lim->max_segments = USHRT_MAX; lim->max_hw_sectors = UINT_MAX; + lim->max_segment_size = UINT_MAX; lim->max_sectors = UINT_MAX; lim->max_write_same_sectors = UINT_MAX; } -- cgit v1.2.3 From 23779fbc99302dddab7f056ae47c3463169cbb64 Mon Sep 17 00:00:00 2001 From: Alireza Haghdoost Date: Wed, 23 Oct 2013 17:08:16 +0100 Subject: block: Enable sysfs nomerge control for I/O requests in the plug list This patch enables the sysfs to control I/O request merge functionality in the plug list. While this control has been implemented for the request queue, it was dismissed in the plug list. Therefore, block layer merges requests together (or attempt to merge) even if the merge capability was disable using sysfs nomerge parameter value 2. This limitation is directly affects functionality of io_submit() system call. The system call enables user to submit a bunch of IO requests from user space using struct iocb **ios input argument. However, the unconditioned merging functionality in the plug list potentially merges these requests together down the road. Therefore, there is no way to distinguish between an application sending bunch of sequential IOs and an application sending one big IO. Ultimately, all requests generated by the former app merge within the plug list together and looks similar to the second app. While the merging functionality is a desirable feature to improve the performance of IO subsystem for some applications, it is not useful for other application like ours at all. Signed-off-by: Alireza Haghdoost Reviewed-by: Jeff Moyer Coding style modified. Signed-off-by: Jens Axboe --- block/blk-core.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'block') diff --git a/block/blk-core.c b/block/blk-core.c index fce4b9387f36..25f13479f552 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1429,6 +1429,9 @@ static bool attempt_plug_merge(struct request_queue *q, struct bio *bio, struct request *rq; bool ret = false; + if (blk_queue_nomerges(q)) + goto out; + plug = current->plug; if (!plug) goto out; -- cgit v1.2.3 From e0ce0eacb3197ad6e4ae37006c73af9411f97ecc Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Wed, 7 Aug 2013 14:20:17 -0700 Subject: block: Use rw_copy_check_uvector() No need for silly open coding - and struct sg_iovec has exactly the same layout as struct iovec... Signed-off-by: Kent Overstreet Cc: Jens Axboe Signed-off-by: Jens Axboe --- block/scsi_ioctl.c | 39 ++++++++++----------------------------- 1 file changed, 10 insertions(+), 29 deletions(-) (limited to 'block') diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index a5ffcc988f0b..625e3e471d65 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -286,7 +286,8 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk, struct sg_io_hdr *hdr, fmode_t mode) { unsigned long start_time; - int writing = 0, ret = 0; + ssize_t ret = 0; + int writing = 0; struct request *rq; char sense[SCSI_SENSE_BUFFERSIZE]; struct bio *bio; @@ -321,37 +322,16 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk, } if (hdr->iovec_count) { - const int size = sizeof(struct sg_iovec) * hdr->iovec_count; size_t iov_data_len; - struct sg_iovec *sg_iov; struct iovec *iov; - int i; - sg_iov = kmalloc(size, GFP_KERNEL); - if (!sg_iov) { - ret = -ENOMEM; + ret = rw_copy_check_uvector(-1, hdr->dxferp, hdr->iovec_count, + 0, NULL, &iov); + if (ret < 0) goto out; - } - - if (copy_from_user(sg_iov, hdr->dxferp, size)) { - kfree(sg_iov); - ret = -EFAULT; - goto out; - } - /* - * Sum up the vecs, making sure they don't overflow - */ - iov = (struct iovec *) sg_iov; - iov_data_len = 0; - for (i = 0; i < hdr->iovec_count; i++) { - if (iov_data_len + iov[i].iov_len < iov_data_len) { - kfree(sg_iov); - ret = -EINVAL; - goto out; - } - iov_data_len += iov[i].iov_len; - } + iov_data_len = ret; + ret = 0; /* SG_IO howto says that the shorter of the two wins */ if (hdr->dxfer_len < iov_data_len) { @@ -361,9 +341,10 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk, iov_data_len = hdr->dxfer_len; } - ret = blk_rq_map_user_iov(q, rq, NULL, sg_iov, hdr->iovec_count, + ret = blk_rq_map_user_iov(q, rq, NULL, (struct sg_iovec *) iov, + hdr->iovec_count, iov_data_len, GFP_KERNEL); - kfree(sg_iov); + kfree(iov); } else if (hdr->dxfer_len) ret = blk_rq_map_user(q, rq, NULL, hdr->dxferp, hdr->dxfer_len, GFP_KERNEL); -- cgit v1.2.3 From 97597dc08f58e25bc74154b7d1c387a4c0432950 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Mon, 4 Nov 2013 14:00:06 +0100 Subject: block: Do not call sector_div() with a 64-bit divisor do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions of 64-bit number by 32-bit numbers. Passing 64-bit divisor types caused issues in the past on 32-bit platforms, cfr. commit ea077b1b96e073eac5c3c5590529e964767fc5f7 ("m68k: Truncate base in do_div()"). As queue_limits.max_discard_sectors and .discard_granularity are unsigned int, max_discard_sectors and granularity should be unsigned int. As bdev_discard_alignment() returns int, alignment should be int. Now 2 calls to sector_div() can be replaced by 32-bit arithmetic: - The 64-bit modulo operation can become a 32-bit modulo operation, - The 64-bit division and multiplication can be replaced by a 32-bit modulo operation and a subtraction. Signed-off-by: Geert Uytterhoeven Signed-off-by: Jens Axboe --- block/blk-lib.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) (limited to 'block') diff --git a/block/blk-lib.c b/block/blk-lib.c index d6f50d572565..9b5b561cb928 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -43,8 +43,8 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, DECLARE_COMPLETION_ONSTACK(wait); struct request_queue *q = bdev_get_queue(bdev); int type = REQ_WRITE | REQ_DISCARD; - sector_t max_discard_sectors; - sector_t granularity, alignment; + unsigned int max_discard_sectors, granularity; + int alignment; struct bio_batch bb; struct bio *bio; int ret = 0; @@ -58,16 +58,14 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, /* Zero-sector (unknown) and one-sector granularities are the same. */ granularity = max(q->limits.discard_granularity >> 9, 1U); - alignment = bdev_discard_alignment(bdev) >> 9; - alignment = sector_div(alignment, granularity); + alignment = (bdev_discard_alignment(bdev) >> 9) % granularity; /* * Ensure that max_discard_sectors is of the proper * granularity, so that requests stay aligned after a split. */ max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9); - sector_div(max_discard_sectors, granularity); - max_discard_sectors *= granularity; + max_discard_sectors -= max_discard_sectors % granularity; if (unlikely(!max_discard_sectors)) { /* Avoid infinite loop below. Being cautious never hurts. */ return -EOPNOTSUPP; -- cgit v1.2.3 From 8616ebb16bcef312024b9d28719f9bf5c5c3aafb Mon Sep 17 00:00:00 2001 From: Duan Jiong Date: Wed, 6 Nov 2013 15:55:44 +0800 Subject: block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong Signed-off-by: Jens Axboe --- block/blk-timeout.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'block') diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 655ba909cd6a..abf725c655fc 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -31,7 +31,7 @@ static int __init fail_io_timeout_debugfs(void) struct dentry *dir = fault_create_debugfs_attr("fail_io_timeout", NULL, &fail_io_timeout); - return IS_ERR(dir) ? PTR_ERR(dir) : 0; + return PTR_ERR_OR_ZERO(dir); } late_initcall(fail_io_timeout_debugfs); -- cgit v1.2.3 From c7d1ba417c7cb7297d14dd47a390ec90ce548d5c Mon Sep 17 00:00:00 2001 From: Duan Jiong Date: Wed, 6 Nov 2013 15:56:39 +0800 Subject: block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong Signed-off-by: Jens Axboe --- block/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'block') diff --git a/block/ioctl.c b/block/ioctl.c index a31d91d9bc5a..7d5c3b20af45 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -64,7 +64,7 @@ static int blkpg_ioctl(struct block_device *bdev, struct blkpg_ioctl_arg __user part = add_partition(disk, partno, start, length, ADDPART_FLAG_NONE, NULL); mutex_unlock(&bdev->bd_mutex); - return IS_ERR(part) ? PTR_ERR(part) : 0; + return PTR_ERR_OR_ZERO(part); case BLKPG_DEL_PARTITION: part = disk_get_part(disk, partno); if (!part) -- cgit v1.2.3