aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-06-24jump_label/x86: Work around asm build bug on older/backported GCCsv4.1/topic/writeback-cgPeter Zijlstra
Boris reported that gcc version 4.4.4 20100503 (Red Hat 4.4.4-2) fails to build linux-next kernels that have this fresh commit via the locking tree: 11276d5306b8 ("locking/static_keys: Add a new static_key interface") The problem appears to be that even though @key and @branch are compile time constants, it doesn't see the following expression as an immediate value: &((char *)key)[branch] More recent GCCs don't appear to have this problem. In particular, Red Hat backported the 'asm goto' feature into 4.4, 'normal' 4.4 compilers will not have this feature and thus not run into this asm. The workaround is to supply both values to the asm as immediates and do the addition in asm. Suggested-by: H. Peter Anvin <hpa@zytor.com> Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit d420acd816c07c7be31bd19d09cbcb16e5572fa6) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-24blkcg: fix incorrect read/write sync/async stat accountingTejun Heo
While unifying how blkcg stats are collected, 77ea733884eb ("blkcg: move io_service_bytes and io_serviced stats into blkcg_gq") incorrectly used bio->flags instead of bio->rw to tell the IO type. This made IOs to be accounted as the wrong type. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 77ea733884eb ("blkcg: move io_service_bytes and io_serviced stats into blkcg_gq") Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 174fd8d369613c4e06660f3704caaba48dac8554) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-24cgroup: fix seq_show_option merge with legacy_nameKees Cook
When seq_show_option (commit a068acf2ee77: "fs: create and use seq_show_option for escaping") was merged, it did not correctly collide with cgroup's addition of legacy_name (commit 3e1d2eed39d8: "cgroup: introduce cgroup_subsys->legacy_name") changes. This fixes the reported name. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 61e57c0c3a37539e13af03ce68598034d37c7256) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-24Revert "writeback: plug writeback at a high level"Linus Torvalds
This reverts commit d353d7587d02116b9732d5c06615aed75a4d3a47. Doing the block layer plug/unplug inside writeback_sb_inodes() is broken, because that function is actually called with a spinlock held: wb->list_lock, as pointed out by Chris Mason. Chris suggested just dropping and re-taking the spinlock around the blk_finish_plug() call (the plgging itself can happen under the spinlock), and that would technically work, but is just disgusting. We do something fairly similar - but not quite as disgusting because we at least have a better reason for it - in writeback_single_inode(), so it's not like the caller can depend on the lock being held over the call, but in this case there just isn't any good reason for that "release and re-take the lock" pattern. [ In general, we should really strive to avoid the "release and retake" pattern for locks, because in the general case it can easily cause subtle bugs when the caller caches any state around the call that might be invalidated by dropping the lock even just temporarily. ] But in this case, the plugging should be easy to just move up to the callers before the spinlock is taken, which should even improve the effectiveness of the plug. So there is really no good reason to play games with locking here. I'll send off a test-patch so that Dave Chinner can verify that that plug movement works. In the meantime this just reverts the problematic commit and adds a comment to the function so that we hopefully don't make this mistake again. Reported-by: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 0ba13fd19d39b7cb672bcec052bc813389c079a4) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-24block: don't release bdi while request_queue has live referencesTejun Heo
bdi's are initialized in two steps, bdi_init() and bdi_register(), but destroyed in a single step by bdi_destroy() which, for a bdi embedded in a request_queue, is called during blk_cleanup_queue() which makes the queue invisible and starts the draining of remaining usages. A request_queue's user can access the congestion state of the embedded bdi as long as it holds a reference to the queue. As such, it may access the congested state of a queue which finished blk_cleanup_queue() but hasn't reached blk_release_queue() yet. Because the congested state was embedded in backing_dev_info which in turn is embedded in request_queue, accessing the congested state after bdi_destroy() was called was fine. The bdi was destroyed but the memory region for the congested state remained accessible till the queue got released. a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") changed the situation. Now, the root congested state which is expected to be pinned while request_queue remains accessible is separately reference counted and the base ref is put during bdi_destroy(). This means that the root congested state may go away prematurely while the queue is between bdi_dstroy() and blk_cleanup_queue(), which was detected by Andrey's KASAN tests. The root cause of this problem is that bdi doesn't distinguish the two steps of destruction, unregistration and release, and now the root congested state actually requires a separate release step. To fix the issue, this patch separates out bdi_unregister() and bdi_exit() from bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue() and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a simple wrapper calling the two steps back-to-back. While at it, the prototype of bdi_destroy() is moved right below bdi_setup_and_register() so that the counterpart operations are located together. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Cc: stable@vger.kernel.org # v4.2+ Reported-and-tested-by: Andrey Konovalov <andreyknvl@google.com> Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com Reviewed-by: Jan Kara <jack@suse.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit b02176f30cd30acccd3b633ab7d9aed8b5da52ff) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: remove old define bdi_unregister() in mm/backing-dev.c
2016-06-24writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inodeTejun Heo
When cgroup writeback is in use, there can be multiple wb's (bdi_writeback's) per bdi and an inode may switch among them dynamically. In a couple places, the wrong wb was used leading to performing operations on the wrong list under the wrong lock corrupting the io lists. * writeback_single_inode() was taking @wb parameter and used it to remove the inode from io lists if it becomes clean after writeback. The callers of this function were always passing in the root wb regardless of the actual wb that the inode was associated with, which could also change while writeback is in progress. Fix it by dropping the @wb parameter and using inode_to_wb_and_lock_list() to determine and lock the associated wb. * After writeback_sb_inodes() writes out an inode, it re-locks @wb and inode to remove it from or move it to the right io list. It assumes that the inode is still associated with @wb; however, the inode may have switched to another wb while writeback was in progress. Fix it by using inode_to_wb_and_lock_list() to determine and lock the associated wb after writeback is complete. As the function requires the original @wb->list_lock locked for the next iteration, in the unlikely case where the inode has changed association, switch the locks. Kudos to Tahsin for pinpointing these subtle breakages. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching") Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg@mail.gmail.com Reported-and-diagnosed-by: Tahsin Erdogan <tahsin@google.com> Tested-by: Tahsin Erdogan <tahsin@google.com> Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit aaf2559332ba272671bb870464a99b909b29a3a1) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-24writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()Tejun Heo
locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with the target inode, unlocks inode, locks the wb's list_lock and verifies that the inode is still associated with the wb. To prevent the wb going away between dropping inode lock and acquiring list_lock, the wb is pinned while inode lock is held. The wb reference is put right after acquiring list_lock citing that the wb won't be dereferenced anymore. This isn't true. If the inode is still associated with the wb, the inode has reference and it's safe to return the wb; however, if inode has been switched, the wb still needs to be unlocked which is a dereference and can lead to use-after-free if it it races with wb destruction. Fix it by putting the reference after releasing list_lock. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 87e1d789bf55 ("writeback: implement [locked_]inode_to_wb_and_lock_list()") Cc: stable@vger.kernel.org # v4.2+ Tested-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 614a4e3773148a31f58dc174bbf578ceb63510c2) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-24writeback: use higher precision calculation in domain_dirty_limits()Tejun Heo
As vm.dirty_[background_]bytes can't be applied verbatim to multiple cgroup writeback domains, they get converted to percentages in domain_dirty_limits() and applied the same way as vm.dirty_[background]ratio. However, if the specified bytes is lower than 1% of available memory, the calculated ratios become zero and the writeback domain gets throttled constantly. Fix it by using per-PAGE_SIZE instead of percentage for ratio calculations. Also, the updated DIV_ROUND_UP() usages now should yield 1/4096 (0.0244%) as the minimum ratio as long as the specified bytes are above zero. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Miao Xie <miaoxie@huawei.com> Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com Cc: stable@vger.kernel.org # v4.2+ Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()") Reviewed-by: Jan Kara <jack@suse.cz> Adjusted comment based on Jan's suggestion. Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 62a584fe05eef1f80ed49a286a29328f1a224fb9) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-24memcg: remove pcp_counter_lockGreg Thelen
Commit 733a572e66d2 ("memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online") removed the last use of the per memcg pcp_counter_lock but forgot to remove the variable. Kill the vestigial variable. Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit ef510194cefe0cd369ef73419cd65b0a5bb4fb5b) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-24bdi: Remove "inline" keyword from exported I_BDEV() implementationGeert Uytterhoeven
With gcc 3.4.6/4.1.2/4.2.4 (not with 4.4.7/4.6.4/4.8.4): CC fs/block_dev.o include/linux/fs.h:804: warning: ‘I_BDEV’ declared inline after being called include/linux/fs.h:804: warning: previous declaration of ‘I_BDEV’ was here Commit a212b105b07d75b4 ("bdi: make inode_to_bdi() inline") added a caller of I_BDEV() in a header file, exposing the bogus "inline" on the exported implementation. Drop the "inline" keyword to fix this. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit ff5053f66677b046cb85695b2a93d4eb07029fa7) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-16writeback: Fix performance regression in wb_over_bg_thresh()Howard Cochran
Commit 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations") unintentionally changed this function's meaning from "are there more dirty pages than the background writeback threshold" to "are there more dirty pages than the writeback threshold". The background writeback threshold is typically half of the writeback threshold, so this had the effect of raising the number of dirty pages required to cause a writeback worker to perform background writeout. This can cause a very severe performance regression when a BDI uses BDI_CAP_STRICTLIMIT because balance_dirty_pages() and the writeback worker can now disagree on whether writeback should be initiated. For example, in a system having 1GB of RAM, a single spinning disk, and a "pass-through" FUSE filesystem mounted over the disk, application code mmapped a 128MB file on the disk and was randomly dirtying pages in that mapping. Because FUSE uses strictlimit and has a default max_ratio of only 1%, in balance_dirty_pages, thresh is ~200, bg_thresh is ~100, and the dirty_freerun_ceiling is the average of those, ~150. So, it pauses the dirtying processes when we have 151 dirty pages and wakes up a background writeback worker. But the worker tests the wrong threshold (200 instead of 100), so it does not initiate writeback and just returns. Thus, balance_dirty_pages keeps looping, sleeping and then waking up the worker who will do nothing. It remains stuck in this state until the few dirty pages that we have finally expire and we write them back for that reason. Then the whole process repeats, resulting in near-zero throughput through the FUSE BDI. The fix is to call the parameterized variant of wb_calc_thresh, so that the worker will do writeback if the bg_thresh is exceeded which was the behavior before the referenced commit. Fixes: 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations") Signed-off-by: Howard Cochran <hcochran@kernelspring.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.2+ Tested-by Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 74d369443325063a5f0260e63971decb950fd8fa) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentationTejun Heo
Now that cgroup v2 is almost out of the door, replace the development documentation unified-hierarchy.txt with Documentation/cgroup.txt which is a superset of unified-hierarchy.txt and authoritatively describes all userland-visible aspects of cgroup. v2: Updated to include all information from blkio-controller.txt and list filesystems which support cgroup writeback as suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> (cherry picked from commit 6c2920926b10e8303378408e3c2b8952071d4344) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: pick up Documentation/cgroup-legacy/unified-hierarchy.txt
2016-06-08cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/Tejun Heo
In preparation for adding cgroup2 documentation, rename Documentation/cgroups/ to Documentation/cgroup-legacy/. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> (cherry picked from commit 0d942766453f3d23a51e0a2d430340a178b0903e) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: pick up Documentation/cgroup-legacy/pids.txt
2016-06-08cgroup: replace __DEVEL__sane_behavior with cgroup2 fs typeTejun Heo
With major controllers - cpu, memory and io - shaping up for the unified hierarchy, cgroup2 is about ready to be, gradually, released into the wild. Replace __DEVEL__sane_behavior flag which was used to select the unified hierarchy with a separate filesystem type "cgroup2" so that unified hierarchy can be mounted as follows. mount -t cgroup2 none $MOUNT_POINT The cgroup2 fs has its own magic number - 0x63677270 ("cgrp"). v2: Assign a different magic number to cgroup2 fs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> (cherry picked from commit 67e9c74b8a873408c27ac9a8e4c1d1c8d72c93ff) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cfq-iosched: fix the setting of IOPS mode on SSDsJens Axboe
A previous commit wanted to make CFQ default to IOPS mode on non-rotational storage, however it did so when the queue was initialized and the non-rotational flag is only set later on in the probe. Add an elevator hook that gets called off the add_disk() path, at that point we know that feature probing has finished, and we can reliably check for the various flags that drivers can set. Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs") Tested-by: Romain Francoise <romain@orebokech.com> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 0bb979472a7401022109e81dd89d777adea58710) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08block: Make CFQ default to IOPS mode on SSDsTahsin Erdogan
CFQ idling causes reduced IOPS throughput on non-rotational disks. Since disk head seeking is not applicable to SSDs, it doesn't really help performance by anticipating future near-by IO requests. By turning off idling (and switching to IOPS mode), we allow other processes to dispatch IO requests down to the driver and so increase IO throughput. Following FIO benchmark results were taken on a cloud SSD offering with idling on and off: Idling iops avg-lat(ms) stddev bw ------------------------------------------------------ On 7054 90.107 38.697 28217KB/s Off 29255 21.836 11.730 117022KB/s fio --name=temp --size=100G --time_based --ioengine=libaio \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \ --filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10 And the following is from a local SSD run: Idling iops avg-lat(ms) stddev bw ------------------------------------------------------ On 19320 33.043 14.068 77281KB/s Off 21626 29.465 12.662 86507KB/s fio --name=temp --size=5G --time_based --ioengine=libaio \ --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \ --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \ --filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10 Reviewed-by: Nauman Rafique <nauman@google.com> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 41c0126b3f22ef36b97b3c38b8f29569848a5ce2) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08writeback: initialize inode members that track writeback historyTahsin Erdogan
inode struct members that track cgroup writeback information should be reinitialized when inode gets allocated from kmem_cache. Otherwise, their values remain and get used by the new inode. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Acked-by: Tejun Heo <tj@kernel.org> Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching") Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 3d65ae4634ed8350aee98a4e6f4e41fe40c7d282) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08writeback: keep superblock pinned during cgroup writeback association switchesTejun Heo
If cgroup writeback is in use, an inode is associated with a cgroup for writeback. If the inode's main dirtier changes to another cgroup, the association gets updated asynchronously. Nothing was pinning the superblock while such switches are in progress and superblock could go away while async switching is pending or in progress leading to crashes like the following. kernel BUG at fs/jbd2/transaction.c:319! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51 Hardware name: Google Google, BIOS Google 01/01/2011 Workqueue: events inode_switch_wbs_work_fn task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000 RIP: 0010:[<ffffffff803e6922>] [<ffffffff803e6922>] start_this_handle+0x382/0x3e0 RSP: 0018:ffff880209267c30 EFLAGS: 00010202 ... Call Trace: [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190 [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70 [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0 [<ffffffff8035338b>] evict+0xbb/0x190 [<ffffffff80354190>] iput+0x130/0x190 [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0 [<ffffffff80279819>] process_one_work+0x129/0x300 [<ffffffff80279b16>] worker_thread+0x126/0x480 [<ffffffff8027ed14>] kthread+0xc4/0xe0 [<ffffffff809771df>] ret_from_fork+0x3f/0x70 Fix it by bumping s_active while cgroup association switching is in flight. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Tahsin Erdogan <tahsin@google.com> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching") Cc: stable@vger.kernel.org #v4.5+ Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 5ff8eaac1636bf6deae86491f4818c4c69d1a9ac) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08memcg: fix dirty page migrationGreg Thelen
The problem starts with a file backed dirty page which is charged to a memcg. Then page migration is used to move oldpage to newpage. Migration: - copies the oldpage's data to newpage - clears oldpage.PG_dirty - sets newpage.PG_dirty - uncharges oldpage from memcg - charges newpage to memcg Clearing oldpage.PG_dirty decrements the charged memcg's dirty page count. However, because newpage is not yet charged, setting newpage.PG_dirty does not increment the memcg's dirty page count. After migration completes newpage.PG_dirty is eventually cleared, often in account_page_cleaned(). At this time newpage is charged to a memcg so the memcg's dirty page count is decremented which causes underflow because the count was not previously incremented by migration. This underflow causes balance_dirty_pages() to see a very large unsigned number of dirty memcg pages which leads to aggressive throttling of buffered writes by processes in non root memcg. This issue: - can harm performance of non root memcg buffered writes. - can report too small (even negative) values in memory.stat[(total_)dirty] counters of all memcg, including the root. To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce page_memcg() and set_page_memcg() helpers. Test: 0) setup and enter limited memcg mkdir /sys/fs/cgroup/test echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes echo $$ > /sys/fs/cgroup/test/cgroup.procs 1) buffered writes baseline dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat 2) buffered writes with compaction antagonist to induce migration yes 1 > /proc/sys/vm/compact_memory & rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k kill % sync grep ^dirty /sys/fs/cgroup/test/memory.stat 3) buffered writes without antagonist, should match baseline rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat (speed, dirty residue) unpatched patched 1) 841 MB/s 0 dirty pages 886 MB/s 0 dirty pages 2) 611 MB/s -33427456 dirty pages 793 MB/s 0 dirty pages 3) 114 MB/s -33427456 dirty pages 891 MB/s 0 dirty pages Notice that unpatched baseline performance (1) fell after migration (3): 841 -> 114 MB/s. In the patched kernel, post migration performance matches baseline. Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting") Signed-off-by: Greg Thelen <gthelen@google.com> Reported-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 0610c25daa3e76e38ad5a8fae683a89ff9f71798) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08block: detach bdev inode from its wb in __blkdev_put()Ilya Dryomov
Since 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") inode, at some point in its lifetime, gets attached to a wb (struct bdi_writeback). Detaching happens on evict, in inode_detach_wb() called from __destroy_inode(), and involves updating wb. However, detaching an internal bdev inode from its wb in __destroy_inode() is too late. Its bdi and by extension root wb are embedded into struct request_queue, which has different lifetime rules and can be freed long before the final bdput() is called (can be from __fput() of a corresponding /dev inode, through dput() - evict() - bd_forget(). bdevs hold onto the underlying disk/queue pair only while opened; as soon as bdev is closed all bets are off. In fact, disk/queue can be gone before __blkdev_put() even returns: 1499 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) 1500 { ... 1518 if (bdev->bd_contains == bdev) { 1519 if (disk->fops->release) 1520 disk->fops->release(disk, mode); [ Driver puts its references to disk/queue ] 1521 } 1522 if (!bdev->bd_openers) { 1523 struct module *owner = disk->fops->owner; 1524 1525 disk_put_part(bdev->bd_part); 1526 bdev->bd_part = NULL; 1527 bdev->bd_disk = NULL; 1528 if (bdev != bdev->bd_contains) 1529 victim = bdev->bd_contains; 1530 bdev->bd_contains = NULL; 1531 1532 put_disk(disk); [ We put ours, the queue is gone The last bdput() would result in a write to invalid memory ] 1533 module_put(owner); ... 1539 } Since bdev inodes are special anyway, detach them in __blkdev_put() after clearing inode's dirty bits, turning the problematic inode_detach_wb() in __destroy_inode() into a noop. add_disk() grabs its disk->queue since 523e1d399ce0 ("block: make gendisk hold a reference to its queue"), so the old ->release comment is removed in favor of the new inode_detach_wb() comment. Cc: stable@vger.kernel.org # 4.2+, needs backporting Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 43d1c0eb7e11919f85200d2fce211173526f7304) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08fs/block_dev.c: Remove WARN_ON() when inode writeback failsVivek Goyal
If a block device is hot removed and later last reference to device is put, we try to writeback the dirty inode. But device is gone and that writeback fails. Currently we do a WARN_ON() which does not seem to be the right thing. Convert it to a ratelimited kernel warning. Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> [jmoyer@redhat.com: get rid of unnecessary name initialization, 80 cols] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit dbd3ca50753e70e09cad747dce23b1a7683a3342) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: make sure a parent css isn't offlined before its childrenTejun Heo
There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. This patch updates offline path so that a parent css is never offlined before its children. Each css keeps online_cnt which reaches zero iff itself and all its children are offline and offline_css() is invoked only after online_cnt reaches zero. This fixes the memory controller bug and allows the fix for cpu controller. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com> Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org (cherry picked from commit aa226ff4a1ce79f229c6b7a4c0a14e17fececd01) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: make sure a parent css isn't freed before its childrenTejun Heo
There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. The previous patch updated ordering for css_offline() which fixes the cpu controller issue. While there currently isn't a known bug caused by misordering of css_free() invocations, let's fix it too for consistency. css_free() ordering can be trivially fixed by moving putting of the parent css below css_free() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> (cherry picked from commit 8bb5ef79bc0f4016ecf79e8dce6096a3c63603e4) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08mm: page_alloc: generalize the dirty balance reserveJohannes Weiner
The dirty balance reserve that dirty throttling has to consider is merely memory not available to userspace allocations. There is nothing writeback-specific about it. Generalize the name so that it's reusable outside of that context. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit a8d0143730d7b42c9fe6d1435d92ecce6863a62a) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08block: fix module reference leak on put_disk() call for cgroups throttleRoman Pen
get_disk(),get_gendisk() calls have non explicit side effect: they increase the reference on the disk owner module. The following is the correct sequence how to get a disk reference and to put it: disk = get_gendisk(...); /* use disk */ owner = disk->fops->owner; put_disk(disk); module_put(owner); fs/block_dev.c is aware of this required module_put() call, but f.e. blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put a module reference. To see a leakage in action cgroups throttle config can be used. In the following script I'm removing throttle for /dev/ram0 (actually this is NOP, because throttle was never set for this device): # lsmod | grep brd brd 5175 0 # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \ /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \ done # lsmod | grep brd brd 5175 100 Now brd module has 100 references. The issue is fixed by calling module_put() just right away put_disk(). Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Cc: Gi-Oh Kim <gi-oh.kim@profitbricks.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 39a169b62b415390398291080dafe63aec751e0a) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08memcg: fix memory.high targetVladimir Davydov
When the memory.high threshold is exceeded, try_charge() schedules a task_work to reclaim the excess. The reclaim target is set to the number of pages requested by try_charge(). This is wrong, because try_charge() usually charges more pages than requested (batch > nr_pages) in order to refill per cpu stocks. As a result, a process in a cgroup can easily exceed memory.high significantly when doing a lot of charges w/o returning to userspace (e.g. reading a file in big chunks). Fix this issue by assuring that when exceeding memory.high a process reclaims as many pages as were actually charged (i.e. batch). Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9516a18a9a253a94292e74f11f92083126c5a237) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08memcg: ratify and consolidate over-charge handlingTejun Heo
try_charge() is the main charging logic of memcg. When it hits the limit but either can't fail the allocation due to __GFP_NOFAIL or the task is likely to free memory very soon, being OOM killed, has SIGKILL pending or exiting, it "bypasses" the charge to the root memcg and returns -EINTR. While this is one approach which can be taken for these situations, it has several issues. * It unnecessarily lies about the reality. The number itself doesn't go over the limit but the actual usage does. memcg is either forced to or actively chooses to go over the limit because that is the right behavior under the circumstances, which is completely fine, but, if at all avoidable, it shouldn't be misrepresenting what's happening by sneaking the charges into the root memcg. * Despite trying, we already do over-charge. kmemcg can't deal with switching over to the root memcg by the point try_charge() returns -EINTR, so it open-codes over-charing. * It complicates the callers. Each try_charge() user has to handle the weird -EINTR exception. memcg_charge_kmem() does the manual over-charging. mem_cgroup_do_precharge() performs unnecessary uncharging of root memcg, which BTW is inconsistent with what memcg_charge_kmem() does but not broken as [un]charging are noops on root memcg. mem_cgroup_try_charge() needs to switch the returned cgroup to the root one. The reality is that in memcg there are cases where we are forced and/or willing to go over the limit. Each such case needs to be scrutinized and justified but there definitely are situations where that is the right thing to do. We alredy do this but with a superficial and inconsistent disguise which leads to unnecessary complications. This patch updates try_charge() so that it over-charges and returns 0 when deemed necessary. -EINTR return is removed along with all special case handling in the callers. While at it, remove the local variable @ret, which was initialized to zero and never changed, along with done: label which just returned the always zero @ret. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 10d53c748bc9531f47e13f98e32ef28be4399862) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08memcg: punt high overage reclaim to return-to-userland pathTejun Heo
Currently, try_charge() tries to reclaim memory synchronously when the high limit is breached; however, if the allocation doesn't have __GFP_WAIT, synchronous reclaim is skipped. If a process performs only speculative allocations, it can blow way past the high limit. This is actually easily reproducible by simply doing "find /". slab/slub allocator tries speculative allocations first, so as long as there's memory which can be consumed without blocking, it can keep allocating memory regardless of the high limit. This patch makes try_charge() always punt the over-high reclaim to the return-to-userland path. If try_charge() detects that high limit is breached, it adds the overage to current->memcg_nr_pages_over_high and schedules execution of mem_cgroup_handle_over_high() which performs synchronous reclaim from the return-to-userland path. As long as kernel doesn't have a run-away allocation spree, this should provide enough protection while making kmemcg behave more consistently. It also has the following benefits. - All over-high reclaims can use GFP_KERNEL regardless of the specific gfp mask in use, e.g. GFP_NOFS, when the limit was breached. - It copes with prio inversion. Previously, a low-prio task with small memory.high might perform over-high reclaim with a bunch of locks held. If a higher prio task needed any of these locks, it would have to wait until the low prio task finished reclaim and released the locks. By handing over-high reclaim to the task exit path this issue can be avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit b23afb93d317c65cef553b804f08dec8a7a0f7e1) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08memcg: flatten task_struct->memcg_oomTejun Heo
task_struct->memcg_oom is a sub-struct containing fields which are used for async memcg oom handling. Most task_struct fields aren't packaged this way and it can lead to unnecessary alignment paddings. This patch flattens it. * task.memcg_oom.memcg -> task.memcg_in_oom * task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask * task.memcg_oom.order -> task.memcg_oom_order * task.memcg_oom.may_oom -> task.memcg_may_oom In addition, task.memcg_may_oom is relocated to where other bitfields are which reduces the size of task_struct. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 626ebc4100285be56fe3546f29b6afeb36b6871a) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: no sched migrate in include/linux/sched.h
2016-06-08mm: memcontrol: eliminate root memory.currentJohannes Weiner
memory.current on the root level doesn't add anything that wouldn't be more accurate and detailed using system statistics. It already doesn't include slabs, and it'll be a pain to keep in sync when further memory types are accounted in the memory controller. Remove it. Note that this applies to the new unified hierarchy interface only. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit f5fc3c5d817435970aa301d066820a9ac12c8120) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in ↵Tejun Heo
bdi_split_work_to_wbs() bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue() to walk @bdi->wb_list. To set up the initial iteration condition, it uses list_entry_rcu() to calculate the entry pointer corresponding to the list head; however, this isn't an actual RCU dereference and using list_entry_rcu() for it ended up breaking a proposed list_entry_rcu() change because it was feeding an non-lvalue pointer into the macro. Don't use the RCU variant for simple pointer offsetting. Use list_entry() instead. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Patrick Marlier <patrick.marlier@gmail.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pranith kumar <bobby.prani@gmail.com> Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit b33e18f61bd18227a456016a77b1a968f5bc1d65) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08writeback: bdi_writeback iteration must not skip dying onesTejun Heo
bdi_for_each_wb() is used in several places to wake up or issue writeback work items to all wb's (bdi_writeback's) on a given bdi. The iteration is performed by walking bdi->cgwb_tree; however, the tree only indexes wb's which are currently active. For example, when a memcg gets associated with a different blkcg, the old wb is removed from the tree so that the new one can be indexed. The old wb starts dying from then on but will linger till all its inodes are drained. As these dying wb's may still host dirty inodes, writeback operations which affect all wb's must include them. bdi_for_each_wb() skipping dying wb's led to sync(2) missing and failing to sync the inodes belonging to those wb's. This patch adds a RCU protected @bdi->wb_list which lists all wb's beloinging to that bdi. wb's are added on creation and removed on release rather than on the start of destruction. bdi_for_each_wb() usages are replaced with list_for_each[_continue]_rcu() iterations over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed. v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs() fixed and unnecessary list head severing in cgwb_bdi_destroy() removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com> Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()") Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit b817525a4a80c04e4ca44192d97a1ffa9f2be572) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08writeback: sync_inodes_sb() must write out I_DIRTY_TIME inodes and always ↵Tejun Heo
call wait_sb_inodes() e79729123f63 ("writeback: don't issue wb_writeback_work if clean") updated writeback path to avoid kicking writeback work items if there are no inodes to be written out; unfortunately, the avoidance logic was too aggressive and broke sync_inodes_sb(). * sync_inodes_sb() must write out I_DIRTY_TIME inodes but I_DIRTY_TIME inodes dont't contribute to bdi/wb_has_dirty_io() tests and were being skipped over. * inodes are taken off wb->b_dirty/io/more_io lists after writeback starts on them. sync_inodes_sb() skipping wait_sb_inodes() when bdi_has_dirty_io() breaks it by making it return while writebacks are in-flight. This patch fixes the breakages by * Removing bdi_has_dirty_io() shortcut from bdi_split_work_to_wbs(). The callers are already testing the condition. * Removing bdi_has_dirty_io() shortcut from sync_inodes_sb() so that it always calls into bdi_split_work_to_wbs() and wait_sb_inodes(). * Making bdi_split_work_to_wbs() consider the b_dirty_time list for WB_SYNC_ALL writebacks. Kudos to Eryu, Dave and Jan for tracking down the issue. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: e79729123f63 ("writeback: don't issue wb_writeback_work if clean") Link: http://lkml.kernel.org/g/20150812101204.GE17933@dhcp-13-216.nay.redhat.com Reported-and-bisected-by: Eryu Guan <eguan@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.com> Cc: Ted Ts'o <tytso@google.com> Signed-off-by: Jens Axboe <axboe@fb.com> (cherry picked from commit 006a0973ed020a81fe1f24b511ce9feb53f70e44) Signed-off-by: Alex Shi <alex.shi@linaro.org> Conflicts: get the merge conflict solution from upstream b0a1ea51bda for fs/fs-writeback.c
2016-06-08mm, vmscan: unlock page while waiting on writebackHugh Dickins
This is merely a politeness: I've not found that shrink_page_list() leads to deadlock with the page it holds locked across wait_on_page_writeback(); but nevertheless, why hold others off by keeping the page locked there? And while we're at it: remove the mistaken "not " from the commentary on this Case 3 (and a distracting blank line from Case 2, if I may). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 7fadc820222497eac234d1d51a66517c00a6ca4c) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup_pids: don't account for the root cgroupTejun Heo
Because accounting resources for the root cgroup sometimes incurs measureable overhead for workloads which don't care about cgroup and often ends up calculating a number which is available elsewhere in a slightly different form, cgroup is not in the business of providing system-wide statistics. The pids controller which was introduced recently was exposing "pids.current" at the root. This patch disable accounting for root cgroup and removes the file from the root directory. While this is a userland visible behavior change, pids has been available only in one version and was badly broken there, so I don't think this will be noticeable. If it turns out to be a problem, we can reinstate it for v1 hierarchies. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com> (cherry picked from commit 67cde9c4938945b9510730c64e68d2f1dd7bc0aa) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: fix handling of multi-destination migration from subtree_control ↵Tejun Heo
enabling Consider the following v2 hierarchy. P0 (+memory) --- P1 (-memory) --- A \- B P0 has memory enabled in its subtree_control while P1 doesn't. If both A and B contain processes, they would belong to the memory css of P1. Now if memory is enabled on P1's subtree_control, memory csses should be created on both A and B and A's processes should be moved to the former and B's processes the latter. IOW, enabling controllers can cause atomic migrations into different csses. The core cgroup migration logic has been updated accordingly but the controller migration methods haven't and still assume that all tasks migrate to a single target css; furthermore, the methods were fed the css in which subtree_control was updated which is the parent of the target csses. pids controller depends on the migration methods to move charges and this made the controller attribute charges to the wrong csses often triggering the following warning by driving a counter negative. WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40() Modules linked in: CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29 ... ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000 ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00 ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8 Call Trace: [<ffffffff81551ffc>] dump_stack+0x4e/0x82 [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0 [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40 [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0 [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330 [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190 [<ffffffff81189016>] cgroup_attach_task+0x176/0x200 [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460 [<ffffffff81189684>] cgroup_procs_write+0x14/0x20 [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0 [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190 [<ffffffff81265f88>] __vfs_write+0x28/0xe0 [<ffffffff812666fc>] vfs_write+0xac/0x1a0 [<ffffffff81267019>] SyS_write+0x49/0xb0 [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76 This patch fixes the bug by removing @css parameter from the three migration methods, ->can_attach, ->cancel_attach() and ->attach() and updating cgroup_taskset iteration helpers also return the destination css in addition to the task being migrated. All controllers are updated accordingly. * Controllers which don't care whether there are one or multiple target csses can be converted trivially. cpu, io, freezer, perf, netclassid and netprio fall in this category. * cpuset's current implementation assumes that there's single source and destination and thus doesn't support v2 hierarchy already. The only change made by this patchset is how that single destination css is obtained. * memory migration path already doesn't do anything on v2. How the single destination css is obtained is updated and the prep stage of mem_cgroup_can_attach() is reordered to accomodate the change. * pids is the only controller which was affected by this bug. It now correctly handles multi-destination migrations and no longer causes counter underflow from incorrect accounting. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Aleksa Sarai <cyphar@cyphar.com> (cherry picked from commit 1f7dd3e5a6e4f093017fff12232572ee1aa4639b) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in ↵Tejun Heo
freezer_attach() If one or more tasks get moved into a frozen css, the frozen state is cleared up from the destination css so that it can be reasserted once the migrated tasks are frozen. freezer_attach() implements this in two separate steps - clearing CGROUP_FROZEN on the target css while processing each task and propagating the clearing upwards after the task loop is done if necessary. This patch merges the two steps. Propagation now takes place inside the task loop. This simplifies the code and prepares it for the fix of multi-destination migration. Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 599c963a0f19b14132065788322207eaa58bc7f8) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()Oleg Nesterov
Now that we know that the forking task can't migrate amd the child is always moved to the same cgroup by cgroup_post_fork()->css_set_move_task() we can change pids_can_fork() and pids_cancel_fork() to just use task_css(current). And since we no longer need to pin this css, we can remove pid_fork(). Note: the patch uses task_css_check(true), perhaps it makes sense to add a helper or change task_css_set_check() to take cgroup_threadgroup_rwsem into account. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit afbcb364bee9e7cf46c94257a82cb9760b6d254f) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()Oleg Nesterov
If the new child migrates to another cgroup before cgroup_post_fork() calls subsys->fork(), then both pids_can_attach() and pids_fork() will do the same pids_uncharge(old_pids) + pids_charge(pids) sequence twice. Change copy_process() to call threadgroup_change_begin/threadgroup_change_end unconditionally. percpu_down_read() is cheap and this allows other cleanups, see the next changes. Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit c9e75f0492b248aeaa7af8991a6fc9a21506bc96) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: make css_set pin its css's to avoid use-afer-freeTejun Heo
A css_set represents the relationship between a set of tasks and css's. css_set never pinned the associated css's. This was okay because tasks used to always disassociate immediately (in RCU sense) - either a task is moved to a different css_set or exits and never accesses css_set again. Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller") and patches leading up to it made a zombie hold onto its css_set and deref the associated css's on its release. Nothing pins the css's after exit and it might have already been freed leading to use-after-free. general protection fault: 0000 [#1] PREEMPT SMP task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000 RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 ... Call Trace: <IRQ> [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0 [<ffffffff810f8893>] cgroup_free+0x53/0xe0 [<ffffffff8104ed62>] __put_task_struct+0x42/0x130 [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130 [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820 [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820 [<ffffffff81056e54>] __do_softirq+0xd4/0x460 [<ffffffff81057369>] irq_exit+0x89/0xa0 [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50 [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90 <EOI> ... Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02 RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40 RSP <ffff88001fc03e20> ---[ end trace 89a4a4b916b90c49 ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt Fix it by making css_set pin the associate css's until its release. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de Fixes: afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller") (cherry picked from commit 53254f900bd9ff1e3cc5628e76126bb403d9d160) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: fix cftype->file_offset handlingTejun Heo
6f60eade2433 ("cgroup: generalize obtaining the handles of and notifying cgroup files") introduced cftype->file_offset so that the handles for per-css file instances can be recorded. These handles then can be used, for example, to generate file modified notifications. Unfortunately, it made the wrong assumption that files are created once for a given css and removed on its destruction. Due to the dependencies among subsystems, a css may be hidden from userland and then later shown again. This is implemented by removing and re-creating the affected files, so the associated kernfs_node for a given cgroup file may change over time. This incorrect assumption led to the corruption of css->files lists. Reimplement cftype->file_offset handling so that cgroup_file->kn is protected by a lock and updated as files are created and destroyed. This also makes keeping them on per-cgroup list unnecessary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: James Sedgwick <jsedgwick@fb.com> Fixes: 6f60eade2433 ("cgroup: generalize obtaining the handles of and notifying cgroup files") Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zefan Li <lizefan@huawei.com> (cherry picked from commit 34c06254ff82a815fdccdfae7517a06c9b768cee) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: pids: fix invalid get/put usageAleksa Sarai
Fix incorrect usage of css_get and css_put to put a different css in pids_{cancel_,}attach() than the one grabbed in pids_can_attach(). This could lead to quite serious memory leakage (and unsafe operations on the putted css). tj: minor comment update Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit ce52399520e4b97466165737e00c7b528ae8c8f5) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: fix race condition around termination check in css_task_iter_next()Tejun Heo
css_task_iter_next() checked @it->cur_task before grabbing css_set_lock and assumed that the result won't change afterwards; however, tasks could leave the cgroup being iterated terminating the iterator before css_task_lock is acquired. If this happens, css_task_iter_next() tries to calculate the current task from NULL cg_list pointer leading to the following oops. BUG: unable to handle kernel paging request at fffffffffffff7d0 IP: [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80 ... CPU: 4 PID: 6391 Comm: JobQDisp2 Not tainted 4.0.9-22_fbk4_rc3_81616_ge8d9cb6 #1 Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B08 03/04/2014 task: ffff880868e46400 ti: ffff88083404c000 task.ti: ffff88083404c000 RIP: 0010:[<ffffffff810d5f22>] [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80 RSP: 0018:ffff88083404fd28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88083404fd68 RCX: ffff8804697fb8b0 RDX: fffffffffffff7c0 RSI: ffff8803b7dff800 RDI: ffffffff822c0278 RBP: ffff88083404fd38 R08: 0000000000017160 R09: ffff88046f4070c0 R10: ffffffff810d61f7 R11: 0000000000000293 R12: ffff880863bf8400 R13: ffff88046b87fd80 R14: 0000000000000000 R15: ffff88083404fe58 FS: 00007fa0567e2700(0000) GS:ffff88046f900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffff7d0 CR3: 0000000469568000 CR4: 00000000001406e0 Stack: 0000000000000246 0000000000000000 ffff88083404fde8 ffffffff810d6248 ffff88083404fd68 0000000000000000 ffff8803b7dff800 000001ef000001ee 0000000000000000 0000000000000000 ffff880863bf8568 0000000000000000 Call Trace: [<ffffffff810d6248>] cgroup_pidlist_start+0x258/0x550 [<ffffffff810cf66d>] cgroup_seqfile_start+0x1d/0x20 [<ffffffff8121f8ef>] kernfs_seq_start+0x5f/0xa0 [<ffffffff811cab76>] seq_read+0x166/0x380 [<ffffffff812200fd>] kernfs_fop_read+0x11d/0x180 [<ffffffff811a7398>] __vfs_read+0x18/0x50 [<ffffffff811a745d>] vfs_read+0x8d/0x150 [<ffffffff811a756f>] SyS_read+0x4f/0xb0 [<ffffffff818d4772>] system_call_fastpath+0x12/0x17 Fix it by moving the termination condition check inside css_set_lock. @it->cur_task is now cleared after being put and @it->task_pos is tested for termination instead of @it->cset_pos as they indicate the same condition and @it->task_pos is what's being dereferenced. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Calvin Owens <calvinowens@fb.com> Fixes: ed27b9f7a17d ("cgroup: don't hold css_set_rwsem across css task iteration") Acked-by: Zefan Li <lizefan@huawei.com> (cherry picked from commit d57456753787ab158f906f1f8eb58d54a2ccd9f4) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08blkcg: don't create "io.stat" on the root cgroupTejun Heo
The stat files on the root cgroup shows stats for the whole system and usually don't contain any information which isn't available through the usual system monitoring mechanisms. Some controllers skip collecting these duplicate stats to optimize cases where cgroup isn't used and later try to emulate the result on demand. This leads to complexities and subtle differences in the information shown through different channels. This is entirely unnecessary and cgroup v2 is dropping stat files which are duplicate from all controllers. This patch removes "io.stat" from the root hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Vivek Goyal <vgoyal@redhat.com> (cherry picked from commit ca0752c5e3e6fad83d286a22d729390bd8004aec) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: drop cgroup__DEVEL__legacy_files_on_dflTejun Heo
Now that interfaces for the major three controllers - cpu, memory, io - are shaping up, there's no reason to have an option to force legacy files to show up on the unified hierarchy for testing. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> (cherry picked from commit e4b7037c8613da41fb3f7b029414fe25370f53c0) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: replace error handling in cgroup_init() with WARN_ON()sTejun Heo
The init sequence shouldn't fail short of bugs and even when it does it's better to continue with the rest of initialization and we were silently ignoring /proc/cgroups creation failure. Drop the explicit error handling and wrap sysfs_create_mount_point(), register_filesystem() and proc_create() with WARN_ON()s. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> (cherry picked from commit 035f4f510583193949168c77dc957293df24bd77) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: add cgroup_subsys->free() method and use it to fix pids controllerTejun Heo
pids controller is completely broken in that it uncharges when a task exits allowing zombies to escape resource control. With the recent updates, cgroup core now maintains cgroup association till task free and pids controller can be fixed by uncharging on free instead of exit. This patch adds cgroup_subsys->free() method and update pids controller to use it instead of ->exit() for uncharging. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com> (cherry picked from commit afcf6c8b75444382e0f9996157207ebae34a8848) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: keep zombies associated with their original cgroupsTejun Heo
cgroup_exit() is called when a task exits and disassociates the exiting task from its cgroups and half-attach it to the root cgroup. This is unnecessary and undesirable. No controller actually needs an exiting task to be disassociated with non-root cgroups. Both cpu and perf_event controllers update the association to the root cgroup from their exit callbacks just to keep consistent with the cgroup core behavior. Also, this disassociation makes it difficult to track resources held by zombies or determine where the zombies came from. Currently, pids controller is completely broken as it uncharges on exit and zombies always escape the resource restriction. With cgroup association being reset on exit, fixing it is pretty painful. There's no reason to reset cgroup membership on exit. The zombie can be removed from its css_set so that it doesn't show up on "cgroup.procs" and thus can't be migrated or interfere with cgroup removal. It can still pin and point to the css_set so that its cgroup membership is maintained. This patch makes cgroup core keep zombies associated with their cgroups at the time of exit. * Previous patches decoupled populated_cnt tracking from css_set lifetime, so a dying task can be simply unlinked from its css_set while pinning and pointing to the css_set. This keeps css_set association from task side alive while hiding it from "cgroup.procs" and populated_cnt tracking. The css_set reference is dropped when the task_struct is freed. * ->exit() callback no longer needs the css arguments as the associated css never changes once PF_EXITING is set. Removed. * cpu and perf_events controllers no longer need ->exit() callbacks. There's no reason to explicitly switch away on exit. The final schedule out is enough. The callbacks are removed. * On traditional hierarchies, nothing changes. "/proc/PID/cgroup" still reports "/" for all zombies. On the default hierarchy, "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged to at the time of exit. If the cgroup gets removed before the task is reaped, " (deleted)" is appended. v2: Build brekage due to missing dummy cgroup_free() when !CONFIG_CGROUP fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> (cherry picked from commit 2e91fa7f6d451e3ea9fec999065d2fd199691f9d) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: make css_set_rwsem a spinlock and rename it to css_set_lockTejun Heo
css_set_rwsem is the inner lock protecting css_sets and is accessed from hot paths such as fork and exit. Internally, it has no reason to be a rwsem or even mutex. There are no internal blocking operations while holding it. This was rwsem because css task iteration used to expose it to external iterator users. As the previous patch updated css task iteration such that the locking is not leaked to its users, there's no reason to keep it a rwsem. This patch converts css_set_rwsem to a spinlock and rename it to css_set_lock. It uses bh-safe operations as a planned usage needs to access it from RCU callback context. Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit f0d9a5f175753a371bc7fdff0d584a8d9cd72bb0) Signed-off-by: Alex Shi <alex.shi@linaro.org>
2016-06-08cgroup: don't hold css_set_rwsem across css task iterationTejun Heo
css_sets are synchronized through css_set_rwsem but the locking scheme is kinda bizarre. The hot paths - fork and exit - have to write lock the rwsem making the rw part pointless; furthermore, many readers already hold cgroup_mutex. One of the readers is css task iteration. It read locks the rwsem over the entire duration of iteration. This leads to silly locking behavior. When cpuset tries to migrate processes of a cgroup to a different NUMA node, css_set_rwsem is held across the entire migration attempt which can take a long time locking out forking, exiting and other cgroup operations. This patch updates css task iteration so that it locks css_set_rwsem only while the iterator is being advanced. css task iteration involves two levels - css_set and task iteration. As css_sets in use are practically immutable, simply pinning the current one is enough for resuming iteration afterwards. Task iteration is tricky as tasks may leave their css_set while iteration is in progress. This is solved by keeping track of active iterators and advancing them if their next task leaves its css_set. v2: put_task_struct() in css_task_iter_next() moved outside css_set_rwsem. A later patch will add cgroup operations to task_struct free path which may grab the same lock and this avoids deadlock possibilities. css_set_move_task() updated to use list_for_each_entry_safe() when walking task_iters and advancing them. This is necessary as advancing an iter may remove it from the list. Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit ed27b9f7a17ddfbc007e16d4d11f33dff4fc2de7) Signed-off-by: Alex Shi <alex.shi@linaro.org>