aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-08-19nohz: Set isolcpus when nohz_full is setlinux-lng-preempt-rt-3.18.16-2015.08Chris Metcalf
nohz_full is only useful with isolcpus are also set, since otherwise the scheduler has to run periodically to try to determine whether to steal work from other cores. Accordingly, when booting with nohz_full=xxx on the command line, we should act as if isolcpus=xxx was also set, and set (or extend) the isolcpus set to include the nohz_full cpus. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mike Galbraith <umgwanakikbuti@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Jones <davej@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430928266-24888-5-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-09Merge tag 'lsk-v3.18-15.07-rt' of ↵Gary S. Robertson
http://git.linaro.org/kernel/linux-linaro-stable into linux-linaro-lng-v3.18-rt LSK RT 15.07 v3.18
2015-07-22hrtimer.c: remove extraneous braceslinux-lng-preempt-rt-3.18.13-2015.07Gary S. Robertson
Signed-off-by: Gary S. Robertson <gary.robertson@linaro.org>
2015-07-22Revert "tick: SHUTDOWN event-dev if no events are required for KTIME_MAX"Gary S. Robertson
This reverts commit c817b87cb66410545e0b45f05a015d3b6bc2cec3. Per request from the patch's author.
2015-07-22clockevents: Stop unused clockevent deviceViresh Kumar
Clockevent device can now be stopped by switching to ONESHOT_STOPPED mode, to avoid getting spurious interrupts on a tickless CPU. This patch switches mode to ONESHOT_STOPPED at three different places and following is the reasoning behind them. 1.) NOHZ_MODE_LOWRES Timers & hrtimers are dependent on tick for their working in this mode and the only place from where clockevent device is programmed is the tick-code. So, we only need to switch clockevent device to ONESHOT_STOPPED mode once ticks aren't required anymore. And the only call site is: tick_nohz_stop_sched_tick(). In LOWRES mode we skip reprogramming the clockevent device here if expires == KTIME_MAX. In addition to that we must also switch the clockevent device to ONESHOT_STOPPED mode to avoid all spurious interrupts that may follow. 2.) NOHZ_MODE_HIGHRES Tick & timers are dependent on hrtimers for their working in this mode and the only place from where clockevent device is programmed is the hrtimer-code. There are two places here from which we reprogram the clockevent device or skip reprogramming it on expires == KTIME_MAX. Instead of skipping reprogramming the clockevent device, also switch its mode to ONESHOT_STOPPED so that it doesn't generate any spurious interrupts. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [forward port to 3.18 - manually applied few patch. ] Signed-off-by: Santosh Shukla <santosh.shukla@linaro.org>
2015-07-22clockevents: Restart clockevent device before using itViresh Kumar
Clockevent device might have been switched to ONESHOT_STOPPED mode to avoid getting spurious interrupts on a tickless CPU. Before reprogramming next event, we must reconfigure clockevent device to ONESHOT mode if required. This patch switches mode to ONESHOT at three different places and following is the reasoning behind them. 1.) NOHZ_MODE_LOWRES Timers & hrtimers are dependent on tick for their working in this mode and the only place from where clockevent device is programmed is the tick-code. So, we need to switch clockevent device to ONESHOT mode before we starting using it. Two routines can restart ticks here in LOWRES mode: tick_nohz_stop_sched_tick() and tick_nohz_restart(). 2.) NOHZ_MODE_HIGHRES Tick & timers are dependent on hrtimers for their working in this mode and the only place from where clockevent device is programmed is the hrtimer-code. Only hrtimer_reprogram() is responsible for programming the clockevent device for next event, if the clockevent device is stopped earlier. And updating that alone is sufficient here. To make sure we haven't missed any corner case, add a WARN() for the case where we try to reprogram clockevent device while we aren't configured in ONESHOT_STOPPED mode. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2015-07-22clockevents: Introduce CLOCK_EVT_MODE_ONESHOT_STOPPED modeViresh Kumar
When no timers/hrtimers are pending, the expiry time is set to a special value: 'KTIME_MAX'. This normally happens with NO_HZ_{IDLE|FULL} in both LOWRES/HIGHRES modes. When 'expiry == KTIME_MAX', we either cancel the 'tick-sched' hrtimer (NOHZ_MODE_HIGHRES) or skip reprogramming clockevent device (NOHZ_MODE_LOWRES). But, the clockevent device is already reprogrammed from the tick-handler for next tick. As the clock event device is programmed in ONESHOT mode it will atleast fire one more time (unnecessarily). Timers on many implementations (like arm_arch_timer, powerpc, etc.) only support PERIODIC mode and their drivers emulate ONESHOT over that. Which means that on these platforms we will get spurious interrupts at last programmed interval rate, normally tick rate. In order to avoid spurious interrupts/wakeups, the clockevent device should be stopped or its interrupts should be masked. A simple (yet hacky) solution to get this fixed could be: update hrtimer_force_reprogram() to always reprogram clockevent device and update clockevent drivers to STOP generating events (or delay it to max time) when 'expires' is set to KTIME_MAX. But the drawback here is that every clockevent driver has to be hacked for this particular case and its very easy for new ones to miss this. However, Thomas suggested to add an optional mode ONESHOT_STOPPED to solve this problem: lkml.org/lkml/2014/5/9/508. This patch adds support for ONESHOT_STOPPED mode in clockevents core. It will only be available to drivers that implement the mode-specific set-mode callbacks instead of the legacy ->set_mode() callback. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2015-07-22clockevents: Introduce mode specific callbacksViresh Kumar
It is not possible for the clockevents core to know which modes (other than those with a corresponding feature flag) are supported by a particular implementation. And drivers are expected to handle transition to all modes elegantly, as ->set_mode() would be issued for them unconditionally. Now, adding support for a new mode complicates things a bit if we want to use the legacy ->set_mode() callback. We need to closely review all clockevents drivers to see if they would break on addition of a new mode. And after such reviews, it is found that we have to do non-trivial changes to most of the drivers [1]. Introduce mode-specific set_mode_*() callbacks, some of which the drivers may or may not implement. A missing callback would clearly convey the message that the corresponding mode isn't supported. A driver may still choose to keep supporting the legacy ->set_mode() callback, but ->set_mode() wouldn't be supporting any new modes beyond RESUME. If a driver wants to get benefited by using a new mode, it would be required to migrate to the mode specific callbacks. The legacy ->set_mode() callback and the newly introduced mode-specific callbacks are mutually exclusive. Only one of them should be supported by the driver. Sanity check is done at the time of registration to distinguish between optional and required callbacks and to make error recovery and handling simpler. If the legacy ->set_mode() callback is provided, all mode specific ones would be ignored by the core. Call sites calling ->set_mode() directly are also updated to use __clockevents_set_mode() instead, as ->set_mode() may not be available anymore for few drivers. [1] https://lkml.org/lkml/2014/12/9/605 [2] https://lkml.org/lkml/2015/1/23/255 Suggested-by: Thomas Gleixner <tglx@linutronix.de> [2] Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2015-07-22no-hz_full: build fixSantosh Shukla
Signed-off-by: Santosh Shukla <santosh.shukla@linaro.org>
2015-07-22hrtimer: make sure PINNED flag is cleared after removing hrtimerViresh Kumar
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [forward port to 3.18] Signed-off-by: Santosh Shukla <santosh.shukla@linaro.org>
2015-07-22sched/nohz: add debugfs control over sched_tick_max_defermentKevin Hilman
Allow debugfs override of sched_tick_max_deferment in order to ease finding/fixing the remaining issues with full nohz. The value to be written is in jiffies, and -1 means the max deferment is disabled (scheduler_tick_max_deferment() returns KTIME_MAX.) Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Kevin Hilman <khilman@linaro.org>
2015-07-22tick: SHUTDOWN event-dev if no events are required for KTIME_MAXViresh Kumar
When expires is set to KTIME_MAX in tick_program_event(), we are sure that there are no events enqueued for a very long time and so there is no point keeping event device running. We will get interrupted without any work to do many a times, for example when timer's counter overflows. So, its better to SHUTDOWN the event device then and restart it ones we get a request for next event. For implementing this a new field 'last_mode' is added to 'struct clock_event_device' to keep track of last mode used. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2015-07-22hrtimer: reprogram event for expires=KTIME_MAX in hrtimer_force_reprogram()Viresh Kumar
In hrtimer_force_reprogram(), we are reprogramming event device only if the next timer event is before KTIME_MAX. But what if it is equal to KTIME_MAX? As we aren't reprogramming it again, it will be set to the last value it was, probably tick interval, i.e. few milliseconds. And we will get a interrupt due to that, wouldn't have any hrtimers to service and return without doing much. But the implementation of event device's driver may make it more stupid. For example: drivers/clocksource/arm_arch_timer.c disables the event device only on SHUTDOWN/UNUSED requests in set-mode. Otherwise, it will keep giving interrupts at tick interval even if hrtimer_interrupt() didn't reprogram tick.. To get this fixed, lets reprogram event device even for KTIME_MAX, so that the timer is scheduled for long enough. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [forward port to 3.18 kernel] Signed-off-by: Santosh Shukla <santosh.shukla@linaro.org>
2015-07-22sched: don't queue timers on quiesced CPUsViresh Kumar
CPUSets have cpusets.quiesce sysfs file now, with which some CPUs can opt for isolating themselves from background kernel activities, like: timers & hrtimers. get_nohz_timer_target() is used for finding suitable CPU for firing a timer. To guarantee that new timers wouldn't be queued on quiesced CPUs, we need to modify this routine. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2015-07-22cpuset: Create sysfs file: cpusets.quiesce to isolate CPUsViresh Kumar
For networking applications, platforms need to provide one CPU per each user space data plane thread. These CPUs shouldn't be interrupted by kernel at all unless userspace has requested for some functionality. Currently, there are background kernel activities that are running on almost every CPU, like: timers/hrtimers/watchdogs/etc, and these are required to be migrated to other CPUs. To achieve that, this patch adds another option to cpusets, i.e. 'quiesce'. Writing '1' on this file would migrate these unbound/unpinned timers/hrtimers away from the CPUs of the cpuset in question. Also it would disallow addition of any new unpinned timers/hrtimers to isolated CPUs (This would be handled in next patch). Writing '0' will disable isolation of CPUs in current cpuset and unpinned timers/hrtimers would be allowed in future on these CPUs. Currently, only timers and hrtimers are migrated. This would be followed by other kernel infrastructure later if required. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [forward port to 3.18] Signed-off-by: Santosh Shukla <santosh.shukla@linaro.org>
2015-07-22hrtimer: create hrtimer_quiesce_cpu() to isolate CPU from hrtimersViresh Kumar
To isolate CPUs (isolate from hrtimers) from sysfs using cpusets, we need some support from the hrtimer core. i.e. A routine hrtimer_quiesce_cpu() which would migrate away all the unpinned hrtimers, but shouldn't touch the pinned ones. This patch creates this routine. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [forward port to 3.18] Signed-off-by: Santosh Shukla <santosh.shukla@linaro.org>
2015-07-22hrtimer: update timer->state with 'pinned' informationViresh Kumar
'Pinned' information would be required in migrate_hrtimers() now, as we can migrate non-pinned timers away without a hotplug (i.e. with cpuset.quiesce). And so we may need to identify pinned timers now, as we can't migrate them. This patch reuses the timer->state variable for setting this flag as there were enough number of free bits available in this variable. And there is no point increasing size of this struct by adding another field. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [forward port to 3.18] Signed-off-by: Santosh Shukla <santosh.shukla@linaro.org>
2015-07-22timer: create timer_quiesce_cpu() to isolate CPU from timersViresh Kumar
To isolate CPUs (isolate from timers) from sysfs using cpusets, we need some support from the timer core. i.e. A routine timer_quiesce_cpu() which would migrates away all the unpinned timers, but shouldn't touch the pinned ones. This patch creates this routine. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [forward port to 3.18] Signed-off-by: Santosh Shukla <santosh.shukla@linaro.org>
2015-07-22timer: track pinned timers with TIMER_PINNED flagViresh Kumar
In order to quiesce a CPU on which Isolation might be required, we need to move away all the timers queued on that CPU. There are two types of timers queued on any CPU: ones that are pinned to that CPU and others can run on any CPU but are queued on CPU in question. And we need to migrate only the second type of timers away from the CPU entering quiesce state. For this we need some basic infrastructure in timer core to identify which timers are pinned and which are not. Hence, this patch adds another flag bit TIMER_PINNED which will be set only for the timers which are pinned to a CPU. It also removes 'pinned' parameter of __mod_timer() as it is no more required. NOTE: One functional change worth mentioning Existing Behavior: add_timer_on() followed by multiple mod_timer() wouldn't pin the timer on CPU mentioned in add_timer_on().. New Behavior: add_timer_on() followed by multiple mod_timer() would pin the timer on CPU running mod_timer(). I didn't gave much attention to this as we should call mod_timer_on() for the timers queued with add_timer_on(). Though if required we can simply clear the TIMER_PINNED flag in mod_timer(). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [forward port to 3.18] Signed-off-by: Santosh Shukla <santosh.shukla@linaro.org>
2015-06-29Merge branch 'linux-linaro-lsk-v3.18' into linux-linaro-lsk-v3.18-rtlsk-v3.18-16.01-rtlsk-v3.18-15.12-rtlsk-v3.18-15.11-rtlsk-v3.18-15.10-rtlsk-v3.18-15.09-rtlsk-v3.18-15.08-rtlsk-v3.18-15.07-rtKevin Hilman
2015-06-29Merge tag 'v3.18.16' of ↵Kevin Hilman
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable into linux-linaro-lsk-v3.18-rt Linux 3.18.16 * tag 'v3.18.16' of git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable: (394 commits) Linux 3.18.16 arch/x86/kvm/mmu.c: work around gcc-4.4.4 bug md/raid0: fix restore to sector variable in raid0_make_request Linux 3.18.15 ARM: OMAP3: Fix booting with thumb2 kernel xfrm: release dst_orig in case of error in xfrm_lookup() ARC: unbork !LLSC build power/reset: at91: fix return value check in at91_reset_platform_probe() vfs: read file_handle only once in handle_to_path drm/radeon: partially revert "fix VM_CONTEXT*_PAGE_TABLE_END_ADDR handling" drm/radeon: don't share plls if monitors differ in audio support drm/radeon: retry dcpd fetch drm/radeon: fix VM_CONTEXT*_PAGE_TABLE_END_ADDR handling drm/radeon: add new bonaire pci id iwlwifi: pcie: prevent using unmapped memory in fw monitor ACPI / init: Fix the ordering of acpi_reserve_resources() sd: Disable support for 256 byte/sector disks storvsc: Set the SRB flags correctly when no data transfer is needed rtlwifi: rtl8192cu: Fix kernel deadlock md/raid5: don't record new size if resize_stripes fails. ...
2015-06-16Make ISR threading a compile-time-only optionGary S. Robertson
Signed-off-by: Gary S. Robertson <gary.robertson@linaro.org> Conflicts: kernel/irq/manage.c Conflicts: include/linux/interrupt.h kernel/irq/manage.c
2015-06-15Merge branch 'v3.18/topic/thermal' into linux-linaro-lsk-v3.18Kevin Hilman
* v3.18/topic/thermal: (66 commits) thermal: exynos: fix compile error in _zone_bind_cooling_device() thermal: of-thermal: add support for reading coefficients property thermal: support slope and offset coefficients thermal: power_allocator: round the division when divvying up power kernel.h: implement DIV_ROUND_CLOSEST_ULL thermal: cpu_cooling: Fix power calculation when CPUs are offline thermal: cpu_cooling: Remove cpu_dev update on policy CPU update thermal: Default OF created trip points to writable thermal: export thermal_zone_parameters to sysfs thermal: core: Add Kconfig option to enable writable trips of: thermal: Introduce sustainable power for a thermal zone thermal: add trace events to the power allocator governor thermal: introduce the Power Allocator governor thermal: cpu_cooling: implement the power cooling device API thermal: extend the cooling device API to include power information thermal: let governors have private data for each thermal zone thermal: fair_share: generalize the weight concept thermal: export weight to sysfs thermal: fair_share: use the weight from the thermal instance thermal: of: fix cooling device weights in device tree ...
2015-06-10sched: Handle priority boosted tasks proper in setscheduler()Thomas Gleixner
[ Upstream commit 0782e63bc6fe7e2d3408d250df11d388b7799c6b ] Ronny reported that the following scenario is not handled correctly: T1 (prio = 10) lock(rtmutex); T2 (prio = 20) lock(rtmutex) boost T1 T1 (prio = 20) sys_set_scheduler(prio = 30) T1 prio = 30 .... sys_set_scheduler(prio = 10) T1 prio = 30 The last step is wrong as T1 should now be back at prio 20. Commit c365c292d059 ("sched: Consider pi boosting in setscheduler()") only handles the case where a boosted tasks tries to lower its priority. Fix it by taking the new effective priority into account for the decision whether a change of the priority is required. Reported-by: Ronny Meeus <ronny.meeus@gmail.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Fixes: c365c292d059 ("sched: Consider pi boosting in setscheduler()") Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-06-09module: Call module notifier on failure after complete_formation()Steven Rostedt
[ Upstream commit 37815bf866ab6722a47550f8d25ad3f1a16a680c ] The module notifier call chain for MODULE_STATE_COMING was moved up before the parsing of args, into the complete_formation() call. But if the module failed to load after that, the notifier call chain for MODULE_STATE_GOING was never called and that prevented the users of those call chains from cleaning up anything that was allocated. Link: http://lkml.kernel.org/r/554C52B9.9060700@gmail.com Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com> Fixes: 4982223e51e8 "module: set nx before marking module MODULE_STATE_COMING" Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-06-09ktime: Fix ktime_divns to do signed divisionJohn Stultz
[ Upstream commit f7bcb70ebae0dcdb5a2d859b09e4465784d99029 ] It was noted that the 32bit implementation of ktime_divns() was doing unsigned division and didn't properly handle negative values. And when a ktime helper was changed to utilize ktime_divns, it caused a regression on some IR blasters. See the following bugzilla for details: https://bugzilla.redhat.com/show_bug.cgi?id=1200353 This patch fixes the problem in ktime_divns by checking and preserving the sign bit, and then reapplying it if appropriate after the division, it also changes the return type to a s64 to make it more obvious this is expected. Nicolas also pointed out that negative dividers would cause infinite loops on 32bit systems, negative dividers is unlikely for users of this function, but out of caution this patch adds checks for negative dividers for both 32-bit (BUG_ON) and 64-bit(WARN_ON) versions to make sure no such use cases creep in. [ tglx: Hand an u64 to do_div() to avoid the compiler warning ] Fixes: 166afb64511e 'ktime: Sanitize ktime_to_us/ms conversion' Reported-and-tested-by: Trevor Cordes <trevor@tecnopolis.ca> Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Josh Boyer <jwboyer@redhat.com> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1431118043-23452-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-06-09ktime: Optimize ktime_divns for constant divisorsNicolas Pitre
[ Upstream commit 8b618628b2bf83512fc8df5e8672619d65adfdfb ] At least on ARM, do_div() is optimized to turn constant divisors into an inline multiplication by the reciprocal value at compile time. However this optimization is missed entirely whenever ktime_divns() is used and the slow out-of-line division code is used all the time. Let ktime_divns() use do_div() inline whenever the divisor is constant and small enough. This will make things like ktime_to_us() and ktime_to_ms() much faster. Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nicolas Pitre <nico@linaro.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-06-05tracing: Add array printing helperDave Martin
If a trace event contains an array, there is currently no standard way to format this for text output. Drivers are currently hacking around this by a) local hacks that use the trace_seq functionailty directly, or b) just not printing that information. For fixed size arrays, formatting of the elements can be open-coded, but this gets cumbersome for arrays of non-trivial size. These approaches result in non-standard content of the event format description delivered to userspace, so userland tools needs to be taught to understand and parse each array printing method individually. This patch implements a __print_array() helper that tracepoint implementations can use instead of reinventing it. A simple C-style syntax is used to delimit the array and its elements {like,this}. So that the helpers can be used with large static arrays as well as dynamic arrays, they take a pointer and element count: they can be used with __get_dynamic_array() for use with dynamic arrays. Link: http://lkml.kernel.org/r/1422449335-8289-2-git-send-email-javi.merino@arm.com Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (cherry picked from commit 6ea22486ba46bcb665de36514094d74575cd1330) Signed-off-by: Kevin Hilman <khilman@linaro.org>
2015-05-20workqueue: Prevent deadlock/stall on RTThomas Gleixner
Austin reported a XFS deadlock/stall on RT where scheduled work gets never exececuted and tasks are waiting for each other for ever. The underlying problem is the modification of the RT code to the handling of workers which are about to go to sleep. In mainline a worker thread which goes to sleep wakes an idle worker if there is more work to do. This happens from the guts of the schedule() function. On RT this must be outside and the accessed data structures are not protected against scheduling due to the spinlock to rtmutex conversion. So the naive solution to this was to move the code outside of the scheduler and protect the data structures by the pool lock. That approach turned out to be a little naive as we cannot call into that code when the thread blocks on a lock, as it is not allowed to block on two locks in parallel. So we dont call into the worker wakeup magic when the worker is blocked on a lock, which causes the deadlock/stall observed by Austin and Mike. Looking deeper into that worker code it turns out that the only relevant data structure which needs to be protected is the list of idle workers which can be woken up. So the solution is to protect the list manipulation operations with preempt_enable/disable pairs on RT and call unconditionally into the worker code even when the worker is blocked on a lock. The preemption protection is safe as there is nothing which can fiddle with the list outside of thread context. Reported-and_tested-by: Austin Schuh <austin@peloton-tech.com> Reported-and_tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://vger.kernel.org/r/alpine.DEB.2.10.1406271249510.5170@nanos Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: stable-rt@vger.kernel.org
2015-05-20sched: Do not clear PF_NO_SETAFFINITY flag in select_fallback_rq()Steven Rostedt
I talked with Peter Zijlstra about this, and he told me that the clearing of the PF_NO_SETAFFINITY flag was to deal with the optimization of migrate_disable/enable() that ignores tasks that have that flag set. But that optimization was removed when I did a rework of the cpu hotplug code. I found that ignoring tasks that had that flag set would cause those tasks to not sync with the hotplug code and cause the kernel to crash. Thus it needed to not treat them special and those tasks had to go though the same work as tasks without that flag set. Now that those tasks are not treated special, there's no reason to clear the flag. May still need to be tested as the migrate_me() code does not ignore those flags. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140701111444.0cfebaa1@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-20rt,ntp: Move call to schedule_delayed_work() to helper threadSteven Rostedt
The ntp code for notify_cmos_timer() is called from a hard interrupt context. schedule_delayed_work() under PREEMPT_RT_FULL calls spinlocks that have been converted to mutexes, thus calling schedule_delayed_work() from interrupt is not safe. Add a helper thread that does the call to schedule_delayed_work and wake up that thread instead of calling schedule_delayed_work() directly. This is only for CONFIG_PREEMPT_RT_FULL, otherwise the code still calls schedule_delayed_work() directly in irq context. Note: There's a few places in the kernel that do this. Perhaps the RT code should have a dedicated thread that does the checks. Just register a notifier on boot up for your check and wake up the thread when needed. This will be a todo. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-20cgroups: use simple wait in css_release()Sebastian Andrzej Siewior
To avoid: |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:914 |in_atomic(): 1, irqs_disabled(): 0, pid: 92, name: rcuc/11 |2 locks held by rcuc/11/92: | #0: (rcu_callback){......}, at: [<ffffffff810e037e>] rcu_cpu_kthread+0x3de/0x940 | #1: (rcu_read_lock_sched){......}, at: [<ffffffff81328390>] percpu_ref_call_confirm_rcu+0x0/0xd0 |Preemption disabled at:[<ffffffff813284e2>] percpu_ref_switch_to_atomic_rcu+0x82/0xc0 |CPU: 11 PID: 92 Comm: rcuc/11 Not tainted 3.18.7-rt0+ #1 | ffff8802398cdf80 ffff880235f0bc28 ffffffff815b3a12 0000000000000000 | 0000000000000000 ffff880235f0bc48 ffffffff8109aa16 0000000000000000 | ffff8802398cdf80 ffff880235f0bc78 ffffffff815b8dd4 000000000000df80 |Call Trace: | [<ffffffff815b3a12>] dump_stack+0x4f/0x7c | [<ffffffff8109aa16>] __might_sleep+0x116/0x190 | [<ffffffff815b8dd4>] rt_spin_lock+0x24/0x60 | [<ffffffff8108d2cd>] queue_work_on+0x6d/0x1d0 | [<ffffffff8110c881>] css_release+0x81/0x90 | [<ffffffff8132844e>] percpu_ref_call_confirm_rcu+0xbe/0xd0 | [<ffffffff813284e2>] percpu_ref_switch_to_atomic_rcu+0x82/0xc0 | [<ffffffff810e03e5>] rcu_cpu_kthread+0x445/0x940 | [<ffffffff81098a2d>] smpboot_thread_fn+0x18d/0x2d0 | [<ffffffff810948d8>] kthread+0xe8/0x100 | [<ffffffff815b9c3c>] ret_from_fork+0x7c/0xb0 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-05-20completion: Use simple wait queuesThomas Gleixner
Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-20rcu-more-swait-conversions.patchThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Merged Steven's static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp) { - swait_wake(&rnp->nocb_gp_wq[rnp->completed & 0x1]); + wake_up_all(&rnp->nocb_gp_wq[rnp->completed & 0x1]); } Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-05-20kernel/treercu: use a simple waitqueueSebastian Andrzej Siewior
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-05-20work-simple: Simple work queue implemenationDaniel Wagner
Provides a framework for enqueuing callbacks from irq context PREEMPT_RT_FULL safe. The callbacks are executed in kthread context. Bases on wait-simple. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-05-20simple-wait: rename and export the equivalent of waitqueue_active()Paul Gortmaker
The function "swait_head_has_waiters()" was internalized into wait-simple.c but it parallels the waitqueue_active of normal waitqueue support. Given that there are over 150 waitqueue_active users in drivers/ fs/ kernel/ and the like, lets make it globally visible, and rename it to parallel the waitqueue_active accordingly. We'll need to do this if we expect to expand its usage beyond RT. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-05-20wait-simple: Rework for use with completionsThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-20wait-simple: Simple waitqueue implementationThomas Gleixner
wait_queue is a swiss army knife and in most of the cases the complexity is not needed. For RT waitqueues are a constant source of trouble as we can't convert the head lock to a raw spinlock due to fancy and long lasting callbacks. Provide a slim version, which allows RT to replace wait queues. This should go mainline as well, as it lowers memory consumption and runtime overhead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> smp_mb() added by Steven Rostedt to fix a race condition with swait wakeups vs adding items to the list.
2015-05-20sched: Add support for lazy preemptionThomas Gleixner
It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-20rcu: Eliminate softirq processing from rcutreePaul E. McKenney
Running RCU out of softirq is a problem for some workloads that would like to manage RCU core processing independently of other softirq work, for example, setting kthread priority. This commit therefore moves the RCU core work from softirq to a per-CPU/per-flavor SCHED_OTHER kthread named rcuc. The SCHED_OTHER approach avoids the scalability problems that appeared with the earlier attempt to move RCU core processing to from softirq to kthreads. That said, kernels built with RCU_BOOST=y will run the rcuc kthreads at the RCU-boosting priority. Reported-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-05-20rt, nohz_full: fix nohz_full for PREEMPT_RT_FULLMike Galbraith
A task being ticked and trying to shut the tick down will fail due to having just awakened ksoftirqd, subtract it from nr_running. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-05-20softirq: make migrate disable/enable conditioned on softirq_nestcnt transitionNicholas Mc Guire
This patch removes the recursive calls to migrate_disable/enable in local_bh_disable/enable the softirq-local-lock.patch introduces local_bh_disable/enable wich decrements/increments the current->softirq_nestcnt and disable/enables migration as well. as softirq_nestcnt (include/linux/sched.h conditioned on CONFIG_PREEMPT_RT_BASE) already is tracking the nesting level of the recursive calls to local_bh_disable/enable (all in kernel/softirq.c) - no need to do it twice. migrate_disable/enable thus can be conditionsed on softirq_nestcnt making a transition from 0-1 to disable migration and 1-0 to re-enable it. No change of functional behavior, this does noticably reduce the observed nesting level of migrate_disable/enable Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-05-20softirq: Adapt NOHZ softirq pending check to new RT schemeThomas Gleixner
We can't rely on ksoftirqd anymore and we need to check the tasks which run a particular softirq and if such a task is pi blocked ignore the other pending bits of that task as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-20API cleanup - use local_lock not __local_lock for softNicholas Mc Guire
trivial API cleanup - kernel/softirq.c was mimiking local_lock. No change of functional behavior Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2015-05-20softirq: Split softirq locksThomas Gleixner
The 3.x RT series removed the split softirq implementation in favour of pushing softirq processing into the context of the thread which raised it. Though this prevents us from handling the various softirqs at different priorities. Now instead of reintroducing the split softirq threads we split the locks which serialize the softirq processing. If a softirq is raised in context of a thread, then the softirq is noted on a per thread field, if the thread is in a bh disabled region. If the softirq is raised from hard interrupt context, then the bit is set in the flag field of ksoftirqd and ksoftirqd is invoked. When a thread leaves a bh disabled region, then it tries to execute the softirqs which have been raised in its own context. It acquires the per softirq / per cpu lock for the softirq and then checks, whether the softirq is still pending in the per cpu local_softirq_pending() field. If yes, it runs the softirq. If no, then some other task executed it already. This allows for zero config softirq elevation in the context of user space tasks or interrupt threads. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-20softirq: Split handling functionThomas Gleixner
Split out the inner handling function, so RT can reuse it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-20softirq: Make serving softirqs a task flagThomas Gleixner
Avoid the percpu softirq_runner pointer magic by using a task flag. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-20perf: Make swevent hrtimer run in irq instead of softirqYong Zhang
Otherwise we get a deadlock like below: [ 1044.042749] BUG: scheduling while atomic: ksoftirqd/21/141/0x00010003 [ 1044.042752] INFO: lockdep is turned off. [ 1044.042754] Modules linked in: [ 1044.042757] Pid: 141, comm: ksoftirqd/21 Tainted: G W 3.4.0-rc2-rt3-23676-ga723175-dirty #29 [ 1044.042759] Call Trace: [ 1044.042761] <IRQ> [<ffffffff8107d8e5>] __schedule_bug+0x65/0x80 [ 1044.042770] [<ffffffff8168978c>] __schedule+0x83c/0xa70 [ 1044.042775] [<ffffffff8106bdd2>] ? prepare_to_wait+0x32/0xb0 [ 1044.042779] [<ffffffff81689a5e>] schedule+0x2e/0xa0 [ 1044.042782] [<ffffffff81071ebd>] hrtimer_wait_for_timer+0x6d/0xb0 [ 1044.042786] [<ffffffff8106bb30>] ? wake_up_bit+0x40/0x40 [ 1044.042790] [<ffffffff81071f20>] hrtimer_cancel+0x20/0x40 [ 1044.042794] [<ffffffff8111da0c>] perf_swevent_cancel_hrtimer+0x3c/0x50 [ 1044.042798] [<ffffffff8111da31>] task_clock_event_stop+0x11/0x40 [ 1044.042802] [<ffffffff8111da6e>] task_clock_event_del+0xe/0x10 [ 1044.042805] [<ffffffff8111c568>] event_sched_out+0x118/0x1d0 [ 1044.042809] [<ffffffff8111c649>] group_sched_out+0x29/0x90 [ 1044.042813] [<ffffffff8111ed7e>] __perf_event_disable+0x18e/0x200 [ 1044.042817] [<ffffffff8111c343>] remote_function+0x63/0x70 [ 1044.042821] [<ffffffff810b0aae>] generic_smp_call_function_single_interrupt+0xce/0x120 [ 1044.042826] [<ffffffff81022bc7>] smp_call_function_single_interrupt+0x27/0x40 [ 1044.042831] [<ffffffff8168d50c>] call_function_single_interrupt+0x6c/0x80 [ 1044.042833] <EOI> [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20 [ 1044.042840] [<ffffffff8168b970>] ? _raw_spin_unlock_irq+0x30/0x70 [ 1044.042844] [<ffffffff8168b976>] ? _raw_spin_unlock_irq+0x36/0x70 [ 1044.042848] [<ffffffff810702e2>] run_hrtimer_softirq+0xc2/0x200 [ 1044.042853] [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20 [ 1044.042857] [<ffffffff81045265>] __do_softirq_common+0xf5/0x3a0 [ 1044.042862] [<ffffffff81045c3d>] __thread_do_softirq+0x15d/0x200 [ 1044.042865] [<ffffffff81045dda>] run_ksoftirqd+0xfa/0x210 [ 1044.042869] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200 [ 1044.042873] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200 [ 1044.042877] [<ffffffff8106b596>] kthread+0xb6/0xc0 [ 1044.042881] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70 [ 1044.042886] [<ffffffff8168d994>] kernel_thread_helper+0x4/0x10 [ 1044.042889] [<ffffffff8107d98c>] ? finish_task_switch+0x8c/0x110 [ 1044.042894] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70 [ 1044.042897] [<ffffffff8168bd5d>] ? retint_restore_args+0xe/0xe [ 1044.042900] [<ffffffff8106b4e0>] ? kthreadd+0x1e0/0x1e0 [ 1044.042902] [<ffffffff8168d990>] ? gs_change+0xb/0xb Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1341476476-5666-1-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-20rt: rwsem/rwlock: lockdep annotationsThomas Gleixner
rwlocks and rwsems on RT do not allow multiple readers. Annotate the lockdep acquire functions accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org