aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx.c
AgeCommit message (Collapse)Author
2012-08-01KVM: VMX: Fix ds/es corruption on i386 with preemptionAvi Kivity
Commit b2da15ac26a0c ("KVM: VMX: Optimize %ds, %es reload") broke i386 in the following scenario: vcpu_load ... vmx_save_host_state vmx_vcpu_run (ds.rpl, es.rpl cleared by hardware) interrupt push ds, es # pushes bad ds, es schedule vmx_vcpu_put vmx_load_host_state reload ds, es (with __USER_DS) pop ds, es # of other thread's stack iret # other thread runs interrupt push ds, es schedule # back in vcpu thread pop ds, es # now with rpl=0 iret ... vcpu_put resume_userspace iret # clears ds, es due to mismatched rpl (instead of resume_userspace, we might return with SYSEXIT and then take an exception; when the exception IRETs we end up with cleared ds, es) Fix by avoiding the optimization on i386 and reloading ds, es on the lightweight exit path. Reported-by: Chris Clayron <chris2553@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-12KVM: VMX: Implement PCID/INVPCID for guests with EPTMao, Junjie
This patch handles PCID/INVPCID for guests. Process-context identifiers (PCIDs) are a facility by which a logical processor may cache information for multiple linear-address spaces so that the processor may retain cached information when software switches to a different linear address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual Volume 3A for details. For guests with EPT, the PCID feature is enabled and INVPCID behaves as running natively. For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD. Signed-off-by: Junjie Mao <junjie.mao@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11KVM: VMX: export PFEC.P bit on eptXiao Guangrong
Export the present bit of page fault error code, the later patch will use it Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09KVM: VMX: Emulate invalid guest state by defaultAvi Kivity
Our emulation should be complete enough that we can emulate guests while they are in big real mode, or in a mode transition that is not virtualizable without unrestricted guest support. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09KVM: VMX: Improve error reporting during invalid guest state emulationAvi Kivity
If instruction emulation fails, report it properly to userspace. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09KVM: VMX: Stop invalid guest state emulation on pending eventAvi Kivity
Process the event, possibly injecting an interrupt, before continuing. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09KVM: VMX: Continue emulating after batch exhaustedAvi Kivity
If we return early from an invalid guest state emulation loop, make sure we return to it later if the guest state is still invalid. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09KVM: VMX: Fix interrupt exit condition during emulationAvi Kivity
Checking EFLAGS.IF is incorrect as we might be in interrupt shadow. If that is the case, the main loop will notice that and not inject the interrupt, causing an endless loop. Fix by using vmx_interrupt_allowed() to check if we can inject an interrupt instead. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09KVM: VMX: Limit iterations with emulator_invalid_guest_stateAvi Kivity
Otherwise, if the guest ends up looping, we never exit the srcu critical section, which causes synchronize_srcu() to hang. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09KVM: VMX: Relax check on unusable segmentAvi Kivity
Some userspace (e.g. QEMU 1.1) munge the d and g bits of segment descriptors, causing us not to recognize them as unusable segments with emulate_invalid_guest_state=1. Relax the check by testing for segment not present (a non-present segment cannot be usable). Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09KVM: VMX: Return correct CPL during transition to protected modeAvi Kivity
In protected mode, the CPL is defined as the lower two bits of CS, as set by the last far jump. But during the transition to protected mode, there is no last far jump, so we need to return zero (the inherited real mode CPL). Fix by reading CPL from the cache during the transition. This isn't 100% correct since we don't set the CPL cache on a far jump, but since protected mode transition will always jump to a segment with RPL=0, it will always work. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-03KVM: VMX: code clean for vmx_init()Guo Chao
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-06-06KVM: Cleanup the kvm_print functions and introduce pr_XX wrappersChristoffer Dall
Introduces a couple of print functions, which are essentially wrappers around standard printk functions, with a KVM: prefix. Functions introduced or modified are: - kvm_err(fmt, ...) - kvm_info(fmt, ...) - kvm_debug(fmt, ...) - kvm_pr_unimpl(fmt, ...) - pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...) Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05KVM: VMX: Fix KVM_SET_SREGS with big real mode segmentsOrit Wasserman
For example migration between Westmere and Nehelem hosts, caught in big real mode. The code that fixes the segments for real mode guest was moved from enter_rmode to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment. Signed-off-by: Orit Wasserman <owasserm@rehdat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05KVM: VMX: Use EPT Access bit in response to memory notifiersXudong Hao
Signed-off-by: Haitao Shan <haitao.shan@intel.com> Signed-off-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05KVM: VMX: Enable EPT A/D bits if supported by turning on relevant bit in EPTPXudong Hao
In EPT page structure entry, Enable EPT A/D bits if processor supported. Signed-off-by: Haitao Shan <haitao.shan@intel.com> Signed-off-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05KVM: VMX: Add parameter to control A/D bits support, default is onXudong Hao
Add kernel parameter to control A/D bits support, it's on by default. Signed-off-by: Haitao Shan <haitao.shan@intel.com> Signed-off-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-16KVM: VMX: Optimize %ds, %es reloadAvi Kivity
On x86_64, we can defer %ds and %es reload to the heavyweight context switch, since nothing in the lightweight paths uses the host %ds or %es (they are ignored by the processor). Furthermore we can avoid the load if the segments are null, by letting the hardware load the null segments for us. This is the expected case. On i386, we could avoid the reload entirely, since the entry.S paths take care of reload, except for the SYSEXIT path which leaves %ds and %es set to __USER_DS. So we set them to the same values as well. Saves about 70 cycles out of 1600 (around 4%; noisy measurements). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16KVM: VMX: Fix %ds/%es clobberAvi Kivity
The vmx exit code unconditionally restores %ds and %es to __USER_DS. This can override the user's values, since %ds and %es are not saved and restored in x86_64 syscalls. In practice, this isn't dangerous since nobody uses segment registers in long mode, least of all programs that use KVM. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-14KVM: VMX: unlike vmcs on fail pathXiao Guangrong
fix: [ 1529.577273] Call Trace: [ 1529.577289] [<ffffffffa060d58f>] kvm_arch_hardware_disable+0x13/0x30 [kvm] [ 1529.577302] [<ffffffffa05fa2d4>] hardware_disable_nolock+0x35/0x39 [kvm] [ 1529.577311] [<ffffffffa05fa29f>] ? cpumask_clear_cpu.constprop.31+0x13/0x13 [kvm] [ 1529.577315] [<ffffffff81096ba8>] on_each_cpu+0x44/0x84 [ 1529.577326] [<ffffffffa05f98b5>] hardware_disable_all_nolock+0x34/0x36 [kvm] [ 1529.577335] [<ffffffffa05f98e2>] hardware_disable_all+0x2b/0x39 [kvm] [ 1529.577349] [<ffffffffa05fafe5>] kvm_put_kvm+0xed/0x10f [kvm] [ 1529.577358] [<ffffffffa05fb3d7>] kvm_vm_release+0x22/0x28 [kvm] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-19Merge branch 'linus' into queueMarcelo Tosatti
Merge reason: development work has dependency on kvm patches merged upstream. Conflicts: Documentation/feature-removal-schedule.txt Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-18KVM: VMX: Fix kvm_set_shared_msr() called in preemptible contextAvi Kivity
kvm_set_shared_msr() may not be called in preemptible context, but vmx_set_msr() does so: BUG: using smp_processor_id() in preemptible [00000000] code: qemu-kvm/22713 caller is kvm_set_shared_msr+0x32/0xa0 [kvm] Pid: 22713, comm: qemu-kvm Not tainted 3.4.0-rc3+ #39 Call Trace: [<ffffffff8131fa82>] debug_smp_processor_id+0xe2/0x100 [<ffffffffa0328ae2>] kvm_set_shared_msr+0x32/0xa0 [kvm] [<ffffffffa03a103b>] vmx_set_msr+0x28b/0x2d0 [kvm_intel] ... Making kvm_set_shared_msr() work in preemptible is cleaner, but it's used in the fast path. Making two variants is overkill, so this patch just disables preemption around the call. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-08KVM: VMX: Auto-load on CPUs with VMXJosh Triplett
Enable x86 feature-based autoloading for the kvm-intel module on CPUs with X86_FEATURE_VMX. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Acked-By: Kay Sievers <kay@vrfy.org> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-05KVM: VMX: vmx_set_cr0 expects kvm->srcu lockedMarcelo Tosatti
vmx_set_cr0 is called from vcpu run context, therefore it expects kvm->srcu to be held (for setting up the real-mode TSS). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-28Merge branch 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Avi Kivity: "Changes include timekeeping improvements, support for assigning host PCI devices that share interrupt lines, s390 user-controlled guests, a large ppc update, and random fixes." This is with the sign-off's fixed, hopefully next merge window we won't have rebased commits. * 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits) KVM: Convert intx_mask_lock to spin lock KVM: x86: fix kvm_write_tsc() TSC matching thinko x86: kvmclock: abstract save/restore sched_clock_state KVM: nVMX: Fix erroneous exception bitmap check KVM: Ignore the writes to MSR_K7_HWCR(3) KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask KVM: PMU: add proper support for fixed counter 2 KVM: PMU: Fix raw event check KVM: PMU: warn when pin control is set in eventsel msr KVM: VMX: Fix delayed load of shared MSRs KVM: use correct tlbs dirty type in cmpxchg KVM: Allow host IRQ sharing for assigned PCI 2.3 devices KVM: Ensure all vcpus are consistent with in-kernel irqchip settings KVM: x86 emulator: Allow PM/VM86 switch during task switch KVM: SVM: Fix CPL updates KVM: x86 emulator: VM86 segments must have DPL 3 KVM: x86 emulator: Fix task switch privilege checks arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation KVM: mmu_notifier: Flush TLBs before releasing mmu_lock ...
2012-03-08KVM: nVMX: Fix erroneous exception bitmap checkNadav Har'El
The code which checks whether to inject a pagefault to L1 or L2 (in nested VMX) was wrong, incorrect in how it checked the PF_VECTOR bit. Thanks to Dan Carpenter for spotting this. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: VMX: Fix delayed load of shared MSRsAvi Kivity
Shared MSRs (MSR_*STAR and related) are stored in both vmx->guest_msrs and in the CPU registers, but vmx_set_msr() only updated memory. Prior to 46199f33c2953, this didn't matter, since we called vmx_load_host_state(), which scheduled a vmx_save_host_state(), which re-synchronized the CPU state, but now we don't, so the CPU state will not be synchronized until the next exit to host userspace. This mostly affects nested vmx workloads, which play with these MSRs a lot. Fix by loading the MSR eagerly. Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: x86 emulator: Fix task switch privilege checksKevin Wolf
Currently, all task switches check privileges against the DPL of the TSS. This is only correct for jmp/call to a TSS. If a task gate is used, the DPL of this take gate is used for the check instead. Exceptions, external interrupts and iret shouldn't perform any check. [avi: kill kvm-kmod remnants] Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: VMX: remove yield_on_hltRaghavendra K T
yield_on_hlt was introduced for CPU bandwidth capping. Now it is redundant with CFS hardlimit. yield_on_hlt also complicates the scenario in paravirtual environment, that needs to trap halt. for e.g. paravirtualized ticket spinlocks. Acked-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Allow adjust_tsc_offset to be in host or guest cyclesMarcelo Tosatti
Redefine the API to take a parameter indicating whether an adjustment is in host or guest cycles. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08KVM: Infrastructure for software and hardware based TSC rate scalingZachary Amsden
This requires some restructuring; rather than use 'virtual_tsc_khz' to indicate whether hardware rate scaling is in effect, we consider each VCPU to always have a virtual TSC rate. Instead, there is new logic above the vendor-specific hardware scaling that decides whether it is even necessary to use and updates all rate variables used by common code. This means we can simply query the virtual rate at any point, which is needed for software rate scaling. There is also now a threshold added to the TSC rate scaling; minor differences and variations of measured TSC rate can accidentally provoke rate scaling to be used when it is not needed. Instead, we have a tolerance variable called tsc_tolerance_ppm, which is the maximum variation from user requested rate at which scaling will be used. The default is 250ppm, which is the half the threshold for NTP adjustment, allowing for some hardware variation. In the event that hardware rate scaling is not available, we can kludge a bit by forcing TSC catchup to turn on when a faster than hardware speed has been requested, but there is nothing available yet for the reverse case; this requires a trap and emulate software implementation for RDTSC, which is still forthcoming. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-21i387: Split up <asm/i387.h> into exported and internal interfacesLinus Torvalds
While various modules include <asm/i387.h> to get access to things we actually *intend* for them to use, most of that header file was really pretty low-level internal stuff that we really don't want to expose to others. So split the header file into two: the small exported interfaces remain in <asm/i387.h>, while the internal definitions that are only used by core architecture code are now in <asm/fpu-internal.h>. The guiding principle for this was to expose functions that we export to modules, and leave them in <asm/i387.h>, while stuff that is used by task switching or was marked GPL-only is in <asm/fpu-internal.h>. The fpu-internal.h file could be further split up too, especially since arch/x86/kvm/ uses some of the remaining stuff for its module. But that kvm usage should probably be abstracted out a bit, and at least now the internal FPU accessor functions are much more contained. Even if it isn't perhaps as contained as it _could_ be. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-18i387: move TS_USEDFPU flag from thread_info to task_structLinus Torvalds
This moves the bit that indicates whether a thread has ownership of the FPU from the TS_USEDFPU bit in thread_info->status to a word of its own (called 'has_fpu') in task_struct->thread.has_fpu. This fixes two independent bugs at the same time: - changing 'thread_info->status' from the scheduler causes nasty problems for the other users of that variable, since it is defined to be thread-synchronous (that's what the "TS_" part of the naming was supposed to indicate). So perfectly valid code could (and did) do ti->status |= TS_RESTORE_SIGMASK; and the compiler was free to do that as separate load, or and store instructions. Which can cause problems with preemption, since a task switch could happen in between, and change the TS_USEDFPU bit. The change to TS_USEDFPU would be overwritten by the final store. In practice, this seldom happened, though, because the 'status' field was seldom used more than once, so gcc would generally tend to generate code that used a read-modify-write instruction and thus happened to avoid this problem - RMW instructions are naturally low fat and preemption-safe. - On x86-32, the current_thread_info() pointer would, during interrupts and softirqs, point to a *copy* of the real thread_info, because x86-32 uses %esp to calculate the thread_info address, and thus the separate irq (and softirq) stacks would cause these kinds of odd thread_info copy aliases. This is normally not a problem, since interrupts aren't supposed to look at thread information anyway (what thread is running at interrupt time really isn't very well-defined), but it confused the heck out of irq_fpu_usable() and the code that tried to squirrel away the FPU state. (It also caused untold confusion for us poor kernel developers). It also turns out that using 'task_struct' is actually much more natural for most of the call sites that care about the FPU state, since they tend to work with the task struct for other reasons anyway (ie scheduling). And the FPU data that we are going to save/restore is found there too. Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to the %esp issue. Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-and-tested-by: Raphael Prevost <raphael@buro.asia> Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16i387: don't ever touch TS_USEDFPU directly, use helper functionsLinus Torvalds
This creates three helper functions that do the TS_USEDFPU accesses, and makes everybody that used to do it by hand use those helpers instead. In addition, there's a couple of helper functions for the "change both CR0.TS and TS_USEDFPU at the same time" case, and the places that do that together have been changed to use those. That means that we have fewer random places that open-code this situation. The intent is partly to clarify the code without actually changing any semantics yet (since we clearly still have some hard to reproduce bug in this area), but also to make it much easier to use another approach entirely to caching the CR0.TS bit for software accesses. Right now we use a bit in the thread-info 'status' variable (this patch does not change that), but we might want to make it a full field of its own or even make it a per-cpu variable. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-13module_param: make bool parameters really bool (arch)Rusty Russell
module_param(bool) used to counter-intuitively take an int. In fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy trick. It's time to remove the int/unsigned int option. For this version it'll simply give a warning, but it'll break next kernel version. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-12-27KVM: VMX: Intercept RDPMCAvi Kivity
Intercept RDPMC and forward it to the PMU emulation code. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: Move cpuid code to new fileAvi Kivity
The cpuid code has grown; put it into a separate file. Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: introduce id_to_memslot functionXiao Guangrong
Introduce id_to_memslot to get memslot by slot id Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: VMX: remove unneeded vmx_load_host_state() calls.Gleb Natapov
vmx_load_host_state() does not handle msrs switching (except MSR_KERNEL_GS_BASE) since commit 26bb0981b3f. Remove call to it where it is no longer make sense. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: nVMX: Fix warning-causing idt-vectoring-info behaviorNadav Har'El
When L0 wishes to inject an interrupt while L2 is running, it emulates an exit to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original nVMX patch 23, titled "Correct handling of interrupt injection". Unfortunately, it is possible (though rare) that at this point there is valid idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2, and when L2 tried to run this interrupt's handler, it got a page fault - so it returns the original interrupt vector in idt_vectoring_info. The problem is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT like we wished to, because the VMX spec guarantees that idt_vectoring_info and exit_reason_external_interrupt can never happen together. This is not just specified in the spec - a KVM L1 actually prints a kernel warning "unexpected, valid vectoring info" if we violate this guarantee, and some users noticed these warnings in L1's logs. In order to better emulate a processor, which would never return the external interrupt and the idt-vectoring-info together, we need to separate the two injection steps: First, complete L1's injection into L2 (i.e., enter L2, injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT. Most of this is already in the code - the only change we need is to remain in L2 (and not exit to L1) in this case. Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that although we do enter L2 first, it will exit immediately after processing its injection, allowing us to promptly inject to L1. Note how we test vmcs12->idt_vectoring_info_field; This isn't really the vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated), but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value of this field. This was explained in patch 25 ("Correct handling of idt vectoring info") of the original nVMX patch series. Thanks to Dave Allan and to Federico Simoncelli for reporting this bug, to Abel Gordon for helping me figure out the solution, and to Avi Kivity for helping to improve it. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXITNadav Har'El
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT. This bit requests that when next entering the guest, we should run it only for as little as possible, and exit again. We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1 to continue running so it can inject an event to it, we unfortunately cannot just pretend to have run L2 for a little while - We must really launch L2, otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2) will be lost. So the existing code runs L2 in this case. But L2 could potentially run for a long time until it exits, and the injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us to request that L2 will be entered, as necessary, but will exit as soon as possible after entry. Our implementation of this request uses smp_send_reschedule() to send a self-IPI, with interrupts disabled. The interrupts remain disabled until the guest is entered, and then, after the entry is complete (often including processing an injection and jumping to the relevant handler), the physical interrupt is noticed and causes an exit. On recent Intel processors, we could have achieved the same goal by using MTF instead of a self-IPI. Another technique worth considering in the future is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to slightly improve performance by avoiding the useless interrupt handler which ends up being called when smp_send_reschedule() is used. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-17KVM: VMX: Check for automatic switch msr table overflowGleb Natapov
Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-17KVM: VMX: Add support for guest/host-only profilingGleb Natapov
Support guest/host-only profiling by switch perf msrs on a guest entry if needed. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-11-17KVM: VMX: add support for switching of PERF_GLOBAL_CTRLGleb Natapov
Some cpus have special support for switching PERF_GLOBAL_CTRL msr. Add logic to detect if such support exists and works properly and extend msr switching code to use it if available. Also extend number of generic msr switching entries to 8. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25KVM: Clean up and extend rate-limited outputJan Kiszka
The use of printk_ratelimit is discouraged, replace it with pr*_ratelimited or __ratelimit. While at it, convert remaining guest-triggerable printks to rate-limited variants. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25KVM: x86: Move kvm_trace_exit into atomic vmexit sectionJan Kiszka
This avoids that events causing the vmexit are recorded before the actual exit reason. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25KVM: APIC: avoid instruction emulation for EOI writesKevin Tian
Instruction emulation for EOI writes can be skipped, since sane guest simply uses MOV instead of string operations. This is a nice improvement when guest doesn't support x2apic or hyper-V EOI support. a single VM bandwidth is observed with ~8% bandwidth improvement (7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation. Signed-off-by: Kevin Tian <kevin.tian@intel.com> <Based on earlier work from>: Signed-off-by: Eddie Dong <eddie.dong@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25KVM: nVMX: Fix nested VMX TSC emulationNadav Har'El
This patch fixes two corner cases in nested (L2) handling of TSC-related issues: 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to the TSC MSR without an exit, then this should set L1's TSC value itself - not offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code). 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore the vmcs12.TSC_OFFSET. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25KVM: L1 TSC handlingNadav Har'El
KVM assumed in several places that reading the TSC MSR returns the value for L1. This is incorrect, because when L2 is running, the correct TSC read exit emulation is to return L2's value. We therefore add a new x86_ops function, read_l1_tsc, to use in places that specifically need to read the L1 TSC, NOT the TSC of the current level of guest. Note that one change, of one line in kvm_arch_vcpu_load, is made redundant by a different patch sent by Zachary Amsden (and not yet applied): kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of course we didn't have to change the call of kvm_get_msr() to read_l1_tsc(). [avi: moved callback to kvm_x86_ops tsc block] Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Acked-by: Zachary Amsdem <zamsden@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-25KVM: VMX: trivial: use BUG_ONJulia Lawall
Use BUG_ON(x) rather than if(x) BUG(); The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier x; @@ -if (x) BUG(); +BUG_ON(x); @@ identifier x; @@ -if (!x) BUG(); +BUG_ON(!x); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>