aboutsummaryrefslogtreecommitdiff
path: root/arch/powerpc/platforms/powernv
AgeCommit message (Collapse)Author
2015-10-21powerpc/powernv: Handle irq_happened flag correctly in off-line loopPaul Mackerras
This fixes a bug where it is possible for an off-line CPU to fail to go into a low-power state (nap/sleep/winkle), and to become unresponsive to requests from the KVM subsystem to wake up and run a VCPU. What can happen is that a maskable interrupt of some kind (external, decrementer, hypervisor doorbell, or HMI) after we have called local_irq_disable() at the beginning of pnv_smp_cpu_kill_self() and before interrupts are hard-disabled inside power7_nap/sleep/winkle(). In this situation, the pending event is marked in the irq_happened flag in the PACA. This pending event prevents power7_nap/sleep/winkle from going to the requested low-power state; instead they return immediately. We don't deal with any of these pending event flags in the off-line loop in pnv_smp_cpu_kill_self() because power7_nap et al. return 0 in this case, so we will have srr1 == 0, and none of the processing to clear interrupts or doorbells will be done. Usually, the most obvious symptom of this is that a KVM guest will fail with a console message saying "KVM: couldn't grab cpu N". This fixes the problem by making sure we handle the irq_happened flags properly. First, we hard-disable before the off-line loop. Once we have hard-disabled, the irq_happened flags can't change underneath us. We unconditionally clear the DEC and HMI flags: there is no processing of timer interrupts while off-line, and the necessary HMI processing is all done in lower-level code. We leave the EE and DBELL flags alone for the first iteration of the loop, so that we won't fail to respond to a split-core request that came in just before hard-disabling. Within the loop, we handle external interrupts if the EE bit is set in irq_happened as well as if the low-power state was interrupted by an external interrupt. (We don't need to do the msgclr for a pending doorbell in irq_happened, because doorbells are edge-triggered and don't remain pending in hardware.) Then we clear both the EE and DBELL flags, and once clear, they cannot be set again (until this CPU comes online again, that is). This also fixes the debug check to not be done when we just ran a KVM guest or when the sleep didn't happen because of a pending event in irq_happened. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-09powerpc/powernv: Panic on unhandled Machine CheckDaniel Axtens
All unrecovered machine check errors on PowerNV should cause an immediate panic. There are 2 reasons that this is the right policy: it's not safe to continue, and we're already trying to reboot. Firstly, if we go through the recovery process and do not successfully recover, we can't be sure about the state of the machine, and it is not safe to recover and proceed. Linux knows about the following sources of Machine Check Errors: - Uncorrectable Errors (UE) - Effective - Real Address Translation (ERAT) - Segment Lookaside Buffer (SLB) - Translation Lookaside Buffer (TLB) - Unknown/Unrecognised In the SLB, TLB and ERAT cases, we can further categorise these as parity errors, multihit errors or unknown/unrecognised. We can handle SLB errors by flushing and reloading the SLB. We can handle TLB and ERAT multihit errors by flushing the TLB. (It appears we may not handle TLB and ERAT parity errors: I will investigate further and send a followup patch if appropriate.) This leaves us with uncorrectable errors. Uncorrectable errors are usually the result of ECC memory detecting an error that it cannot correct, but they also crop up in the context of PCI cards failing during DMA writes, and during CAPI error events. There are several types of UE, and there are 3 places a UE can occur: Skiboot, the kernel, and userspace. For Skiboot errors, we have the facility to make some recoverable. For userspace, we can simply kill (SIGBUS) the affected process. We have no meaningful way to deal with UEs in kernel space or in unrecoverable sections of Skiboot. Currently, these unrecovered UEs fall through to machine_check_expection() in traps.c, which calls die(), which OOPSes and sends SIGBUS to the process. This sometimes allows us to stumble onwards. For example we've seen UEs kill the kernel eehd and khugepaged. However, the process killed could have held a lock, or it could have been a more important process, etc: we can no longer make any assertions about the state of the machine. Similarly if we see a UE in skiboot (and again we've seen this happen), we're not in a position where we can make any assertions about the state of the machine. Likewise, for unknown or unrecognised errors, we're not able to say anything about the state of the machine. Therefore, if we have an unrecovered MCE, the most appropriate thing to do is to panic. The second reason is that since e784b6499d9c ("powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable machine check errors."), we attempt a special OPAL reboot on an unhandled MCE. This is so the hardware can record error data for later debugging. The comments in that commit assert that we are heading down the panic path anyway. At the moment this is not always true. With UEs in kernel space, for instance, they are marked as recoverable by the hardware, so if the attempt to reboot failed (e.g. old Skiboot), we wouldn't panic() but would simply die() and OOPS. It doesn't make sense to be staggering on if we've just tried to reboot: we should panic(). Explicitly panic() on unrecovered MCEs on PowerNV. Update the comments appropriately. This fixes some hangs following EEH events on cxlflash setups. Signed-off-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-09-10powerpc/MSI: Fix race condition in tearing down MSI interruptsPaul Mackerras
This fixes a race which can result in the same virtual IRQ number being assigned to two different MSI interrupts. The most visible consequence of that is usually a warning and stack trace from the sysfs code about an attempt to create a duplicate entry in sysfs. The race happens when one CPU (say CPU 0) is disposing of an MSI while another CPU (say CPU 1) is setting up an MSI. CPU 0 calls (for example) pnv_teardown_msi_irqs(), which calls msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its hardware IRQ number) is no longer in use. Then, before CPU 0 gets to calling irq_dispose_mapping() to free up the virtal IRQ number, CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an MSI, and gets the same hardware IRQ number that CPU 0 just freed. CPU 1 then calls irq_create_mapping() to get a virtual IRQ number, which sees that there is currently a mapping for that hardware IRQ number and returns the corresponding virtual IRQ number (which is the same virtual IRQ number that CPU 0 was using). CPU 0 then calls irq_dispose_mapping() and frees that virtual IRQ number. Now, if another CPU comes along and calls irq_create_mapping(), it is likely to get the virtual IRQ number that was just freed, resulting in the same virtual IRQ number apparently being used for two different hardware interrupts. To fix this race, we just move the call to msi_bitmap_free_hwirqs() to after the call to irq_dispose_mapping(). Since virq_to_hw() doesn't work for the virtual IRQ number after irq_dispose_mapping() has been called, we need to call it before irq_dispose_mapping() and remember the result for the msi_bitmap_free_hwirqs() call. The pattern of calling msi_bitmap_free_hwirqs() before irq_dispose_mapping() appears in 5 places under arch/powerpc, and appears to have originated in commit 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend") from 2007. Fixes: 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend") Cc: stable@vger.kernel.org # v2.6.22+ Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-09-07powerpc/powernv/pci-ioda: fix kdump with non-power-of-2 crashkernel=Nishanth Aravamudan
The 32-bit TCE table initialization relies on the DMA window having a size equal to a power of 2 (and checks for it explicitly). But crashkernel= has no constraint that requires a power-of-2 be specified. This causes the kdump kernel to fail to boot as none of the PCI devices (including the disk controller) are successfully initialized. After this change, the PCI devices successfully set up the 32-bit TCE table and kdump succeeds. Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages") Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: stable@vger.kernel.org # 4.2 Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-09-07powerpc/powernv/pci-ioda: fix 32-bit TCE table init in kdump kernelNishanth Aravamudan
When attempting to kdump with the 4.2 kernel, we see for each PCI device: pci 0003:01 : [PE# 000] Assign DMA32 space pci 0003:01 : [PE# 000] Setting up 32-bit TCE table at 0..80000000 pci 0003:01 : [PE# 000] Failed to create 32-bit TCE table, err -22 PCI: Domain 0004 has 8 available 32-bit DMA segments PCI: 4 PE# for a total weight of 70 pci 0004:01 : [PE# 002] Assign DMA32 space pci 0004:01 : [PE# 002] Setting up 32-bit TCE table at 0..80000000 pci 0004:01 : [PE# 002] Failed to create 32-bit TCE table, err -22 pci 0004:0d : [PE# 005] Assign DMA32 space pci 0004:0d : [PE# 005] Setting up 32-bit TCE table at 0..80000000 pci 0004:0d : [PE# 005] Failed to create 32-bit TCE table, err -22 pci 0004:0e : [PE# 006] Assign DMA32 space pci 0004:0e : [PE# 006] Setting up 32-bit TCE table at 0..80000000 pci 0004:0e : [PE# 006] Failed to create 32-bit TCE table, err -22 pci 0004:10 : [PE# 008] Assign DMA32 space pci 0004:10 : [PE# 008] Setting up 32-bit TCE table at 0..80000000 pci 0004:10 : [PE# 008] Failed to create 32-bit TCE table, err -22 and eventually the kdump kernel fails to boot as none of the PCI devices (including the disk controller) are successfully initialized. The EINVAL response is because the DMA window (the 2GB base window) is larger than the kdump kernel's reserved memory (crashkernel=, in this case specified to be 1024M). The check in question, if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size)) is a valid sanity check for pnv_pci_ioda2_table_alloc_pages(), so adjust the caller to pass in a smaller window size if our maximum memory value is smaller than the DMA window. After this change, the PCI devices successfully set up the 32-bit TCE table and kdump succeeds. The problem was seen on a Firestone machine originally. Fixes: aca6913f5551 ("powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages") Cc: stable@vger.kernel.org # 4.2 Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> [mpe: Coding style pedantry, use u64, change the indentation] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-09-03Merge tag 'powerpc-4.3-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - support "hybrid" iommu/direct DMA ops for coherent_mask < dma_mask from Benjamin Herrenschmidt - EEH fixes for SRIOV from Gavin - introduce rtas_get_sensor_fast() for IRQ handlers from Thomas Huth - use hardware RNG for arch_get_random_seed_* not arch_get_random_* from Paul Mackerras - seccomp filter support from Michael Ellerman - opal_cec_reboot2() handling for HMIs & machine checks from Mahesh Salgaonkar - add powerpc timebase as a trace clock source from Naveen N. Rao - misc cleanups in the xmon, signal & SLB code from Anshuman Khandual - add an inline function to update POWER8 HID0 from Gautham R. Shenoy - fix pte_pagesize_index() crash on 4K w/64K hash from Michael Ellerman - drop support for 64K local store on 4K kernels from Michael Ellerman - move dma_get_required_mask() from pnv_phb to pci_controller_ops from Andrew Donnellan - initialize distance lookup table from drconf path from Nikunj A Dadhania - enable RTC class support from Vaibhav Jain - disable automatically blocked PCI config from Gavin Shan - add LEDs driver for PowerNV platform from Vasant Hegde - fix endianness issues in the HVSI driver from Laurent Dufour - kexec endian fixes from Samuel Mendoza-Jonas - fix corrupted pdn list from Gavin Shan - fix fenced PHB caused by eeh_slot_error_detail() from Gavin Shan - Freescale updates from Scott: Highlights include 32-bit memcpy/memset optimizations, checksum optimizations, 85xx config fragments and updates, device tree updates, e6500 fixes for non-SMP, and misc cleanup and minor fixes. - a ton of cxl updates & fixes: - add explicit precision specifiers from Rasmus Villemoes - use more common format specifier from Rasmus Villemoes - destroy cxl_adapter_idr on module_exit from Johannes Thumshirn - destroy afu->contexts_idr on release of an afu from Johannes Thumshirn - compile with -Werror from Daniel Axtens - EEH support from Daniel Axtens - plug irq_bitmap getting leaked in cxl_context from Vaibhav Jain - add alternate MMIO error handling from Ian Munsie - allow release of contexts which have been OPENED but not STARTED from Andrew Donnellan - remove use of macro DEFINE_PCI_DEVICE_TABLE from Vaishali Thakkar - release irqs if memory allocation fails from Vaibhav Jain - remove racy attempt to force EEH invocation in reset from Daniel Axtens - fix + cleanup error paths in cxl_dev_context_init from Ian Munsie - fix force unmapping mmaps of contexts allocated through the kernel api from Ian Munsie - set up and enable PSL Timebase from Philippe Bergheaud * tag 'powerpc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (140 commits) cxl: Set up and enable PSL Timebase cxl: Fix force unmapping mmaps of contexts allocated through the kernel api cxl: Fix + cleanup error paths in cxl_dev_context_init powerpc/eeh: Fix fenced PHB caused by eeh_slot_error_detail() powerpc/pseries: Cleanup on pci_dn_reconfig_notifier() powerpc/pseries: Fix corrupted pdn list powerpc/powernv: Enable LEDS support powerpc/iommu: Set default DMA offset in dma_dev_setup cxl: Remove racy attempt to force EEH invocation in reset cxl: Release irqs if memory allocation fails cxl: Remove use of macro DEFINE_PCI_DEVICE_TABLE powerpc/powernv: Fix mis-merge of OPAL support for LEDS driver powerpc/powernv: Reset HILE before kexec_sequence() powerpc/kexec: Reset secondary cpu endianness before kexec powerpc/hvsi: Fix endianness issues in the HVSI driver leds/powernv: Add driver for PowerNV platform powerpc/powernv: Create LED platform device powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED states powerpc/powernv: Fix the log message when disabling VF cxl: Allow release of contexts which have been OPENED but not STARTED ...
2015-09-01Merge branch 'irq-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "This updated pull request does not contain the last few GIC related patches which were reported to cause a regression. There is a fix available, but I let it breed for a couple of days first. The irq departement provides: - new infrastructure to support non PCI based MSI interrupts - a couple of new irq chip drivers - the usual pile of fixlets and updates to irq chip drivers - preparatory changes for removal of the irq argument from interrupt flow handlers - preparatory changes to remove IRQF_VALID" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits) irqchip/imx-gpcv2: IMX GPCv2 driver for wakeup sources irqchip: Add bcm2836 interrupt controller for Raspberry Pi 2 irqchip: Add documentation for the bcm2836 interrupt controller irqchip/bcm2835: Add support for being used as a second level controller irqchip/bcm2835: Refactor handle_IRQ() calls out of MAKE_HWIRQ PCI: xilinx: Fix typo in function name irqchip/gic: Ensure gic_cpu_if_up/down() programs correct GIC instance irqchip/gic: Only allow the primary GIC to set the CPU map PCI/MSI: pci-xgene-msi: Consolidate chained IRQ handler install/remove unicore32/irq: Prepare puv3_gpio_handler for irq argument removal tile/pci_gx: Prepare trio_handle_level_irq for irq argument removal m68k/irq: Prepare irq handlers for irq argument removal C6X/megamode-pic: Prepare megamod_irq_cascade for irq argument removal blackfin: Prepare irq handlers for irq argument removal arc/irq: Prepare idu_cascade_isr for irq argument removal sparc/irq: Use access helper irq_data_get_affinity_mask() sparc/irq: Use helper irq_data_get_irq_handler_data() parisc/irq: Use access helper irq_data_get_affinity_mask() mn10300/irq: Use access helper irq_data_get_affinity_mask() irqchip/i8259: Prepare i8259_irq_dispatch for irq argument removal ...
2015-08-27powerpc/iommu: Set default DMA offset in dma_dev_setupAlexey Kardashevskiy
Commit e91c25111aa3 "powerpc/iommu: Cleanup setting of DMA base/offset" expects that the default DMA offset is set from pnv_ioda_setup_bus_dma() which is correct unless it is SRIOV where the code flow is different - at the moment when pnv_ioda_setup_bus_dma() is called, PCI devices for VFs are not created yet. This adds missing set_dma_offset() to pnv_pci_ioda_dma_dev_setup() to cover the case of SRIOV. Note that we still need set_dma_offset() in pnv_ioda_setup_bus_dma() as at the boot time pnv_pci_ioda_dma_dev_setup() is called when no PE was created yet, this happens at the PHB fixup stage. Fixes: e91c25111aa3 ("powerpc/iommu: Cleanup setting of DMA base/offset") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-20powerpc/powernv: Reset HILE before kexec_sequence()Samuel Mendoza-Jonas
On powernv secondary cpus are returned to OPAL, and will then enter the target kernel in big-endian. However if it is set the HILE bit will persist, causing the first exception in the target kernel to be delivered in litte-endian regardless of the current endianness. If running on top of OPAL make sure the HILE bit is reset once we've finished waiting for all of the secondaries to be returned to OPAL. Signed-off-by: Samuel Mendoza-Jonas <sam.mj@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-20powerpc/powernv: Create LED platform deviceVasant Hegde
This patch adds platform devices for leds. Also export LED related OPAL API's so that led driver can use these APIs. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-20powerpc/powernv: Add OPAL interfaces for accessing and modifying system LED ↵Anshuman Khandual
states This patch registers the following two new OPAL interfaces calls for the platform LED subsystem. With the help of these new OPAL calls, the kernel will be able to get or set the state of various individual LEDs on the system at any given location code which is passed through the LED specific device tree nodes. (1) OPAL_LEDS_GET_INDICATOR opal_leds_get_ind (2) OPAL_LEDS_SET_INDICATOR opal_leds_set_ind Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com> Tested-by: Stewart Smith <stewart@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-20powerpc/powernv: Fix the log message when disabling VFWei Yang
On powernv platform, IOV BAR would be shifted if necessary. While the log message is not correct when disabling VFs. This patch fixes this by print correct message based on the offset value. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-18powerpc/powernv: move dma_get_required_mask from pnv_phb to pci_controller_opsAndrew Donnellan
Simplify the dma_get_required_mask call chain by moving it from pnv_phb to pci_controller_ops, similar to commit 763d2d8df1ee ("powerpc/powernv: Move dma_set_mask from pnv_phb to pci_controller_ops"). Previous call chain: 0) call dma_get_required_mask() (kernel/dma.c) 1) call ppc_md.dma_get_required_mask, if it exists. On powernv, that points to pnv_dma_get_required_mask() (platforms/powernv/setup.c) 2) device is PCI, therefore call pnv_pci_dma_get_required_mask() (platforms/powernv/pci.c) 3) call phb->dma_get_required_mask if it exists 4) it only exists in the ioda case, where it points to pnv_pci_ioda_dma_get_required_mask() (platforms/powernv/pci-ioda.c) New call chain: 0) call dma_get_required_mask() (kernel/dma.c) 1) device is PCI, therefore call pci_controller_ops.dma_get_required_mask if it exists 2) in the ioda case, that points to pnv_pci_ioda_dma_get_required_mask() (platforms/powernv/pci-ioda.c) In the p5ioc2 case, the call chain remains the same - dma_get_required_mask() does not find either a ppc_md call or pci_controller_ops call, so it calls __dma_get_required_mask(). Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-14powerpc: Add an inline function to update POWER8 HID0Gautham R. Shenoy
Section 3.7 of Version 1.2 of the Power8 Processor User's Manual prescribes that updates to HID0 be preceded by a SYNC instruction and followed by an ISYNC instruction (Page 91). Create an inline function name update_power8_hid0() which follows this recipe and invoke it from the static split core path. Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: Sam Bobroff <sam.bobroff@au1.ibm.com> Tested-by: Sam Bobroff <sam.bobroff@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-06powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable HMI.Mahesh Salgaonkar
Invoke new opal_cec_reboot2() call with reboot type OPAL_REBOOT_PLATFORM_ERROR (for unrecoverable HMI interrupts) to inform BMC/OCC about this error, so that BMC can collect relevant data for error analysis and decide what component to de-configure before rebooting. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-06powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable machine check ↵Mahesh Salgaonkar
errors. On non-recoverable MCE errors in kernel space, Linux kernel panics and system reboots. On BMC based system opal-prd runs as a daemon in the host. Hence, kernel crash may prevent opal-prd to detect and analyze this MCE error. This may land us in a situation where the faulty memory never gets de-configured and Linux would keep hitting same MCE error again and again. If this happens in early stage of kernel initialization, then Linux will keep crashing and rebooting in a loop. This patch fixes this issue by invoking new opal_cec_reboot2() call with reboot type OPAL_REBOOT_PLATFORM_ERROR to inform BMC/OCC about this error, so that BMC can collect relevant data for error analysis and decide what component to de-configure before rebooting. This patch is dependent on OPAL patchset posted on skiboot mailing list at https://lists.ozlabs.org/pipermail/skiboot/2015-July/001771.html that introduces opal_cec_reboot2() opal call. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-06powerpc/powernv: Pull all HMI events before panic.Mahesh Salgaonkar
In the event of unrecovered HMI the existing code panics as soon as it receives the first unrecovered HMI event. This makes host to report partial information about HMIs before panic. There may be more errors which would have caused the HMI and hence more HMI event would have been generated waiting to be pulled by host. This patch implements a logic to pull and display all the HMI event before going down panic path. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-08-06powerpc/powernv: display reason for Malfunction Alert HMI.Mahesh Salgaonkar
The V2 version of HMI event now carries additional information for Malfunction Alert. It now contains error information about CORE and NX checkstop. This patch checks and displays the check stop reason before panic. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-30powerpc/eeh-powernv: Fix unbalanced IRQ warningAlistair Popple
pnv_eeh_next_error() re-enables the eeh opal event interrupt but it gets called from a loop if there are more outstanding events to process, resulting in a warning due to enabling an already enabled interrupt. Instead the interrupt should only be re-enabled once the last outstanding event has been processed. Tested-by: Daniel Axtens <dja@axtens.net> Reported-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Alistair Popple <alistair@popple.id.au> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-30genirq/irqdomain: Allow irq domain aliasingMarc Zyngier
It is not uncommon (at least with the ARM stuff) to have a piece of hardware that implements different flavours of "interrupts". A typical example of this is the GICv3 ITS, which implements standard PCI/MSI support, but also some form of "generic MSI". So far, the PCI/MSI domain is registered using the ITS device_node, so that irq_find_host can return it. On the contrary, the raw MSI domain is not registered with an device_node, making it impossible to be looked up by another subsystem (obviously, using the same device_node twice would only result in confusion, as it is not defined which one irq_find_host would return). A solution to this is to "type" domains that may be aliasing, and to be able to lookup an device_node that matches a given type. For this, we introduce irq_find_matching_host() as a superset of irq_find_host: struct irq_domain *irq_find_matching_host(struct device_node *node, enum irq_domain_bus_token bus_token); where bus_token is the "type" we want to match the domain against (so far, only DOMAIN_BUS_ANY is defined). This result in some moderately invasive changes on the PPC side (which is the only user of the .match method). This has otherwise no functionnal change. Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: <linux-arm-kernel@lists.infradead.org> Cc: Yijing Wang <wangyijing@huawei.com> Cc: Ma Jun <majun258@huawei.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: Duc Dang <dhdang@apm.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Jason Cooper <jason@lakedaemon.net> Link: http://lkml.kernel.org/r/1438091186-10244-2-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-07-30Merge branch 'linus' into irq/coreThomas Gleixner
Pull in upstream fixes before applying conflicting changes
2015-07-23powerpc/powernv/ioda2: Fix calculation for memory allocated for TCE tableAlexey Kardashevskiy
The existing code stores the amount of memory allocated for a TCE table. At the moment it uses @offset which is a virtual offset in the TCE table which is only correct for a one level tables and it does not include memory allocated for intermediate levels. When multilevel TCE table is requested, WARN_ON in tce_iommu_create_table() prints a warning. This adds an additional counter to pnv_pci_ioda2_table_do_alloc_pages() to count actually allocated memory. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-23powerpc: Use hardware RNG for arch_get_random_seed_* not arch_get_random_*Paul Mackerras
The hardware RNG on POWER8 and POWER7+ can be relatively slow, since it can only supply one 64-bit value per microsecond. Currently we read it in arch_get_random_long(), but that slows down reading from /dev/urandom since the code in random.c calls arch_get_random_long() for every longword read from /dev/urandom. Since the hardware RNG supplies high-quality entropy on every read, it matches the semantics of arch_get_random_seed_long() better than those of arch_get_random_long(). Therefore this commit makes the code use the POWER8/7+ hardware RNG only for arch_get_random_seed_{long,int} and not for arch_get_random_{long,int}. This won't affect any other PowerPC-based platforms because none of them currently support a hardware RNG. To make it clear that the ppc_md function pointer is used for arch_get_random_seed_*, we rename it from get_random_long to get_random_seed. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-22powerpc/PCI: Use for_pci_msi_entry() to access MSI device listJiang Liu
Use accessor for_each_pci_msi_entry() to access MSI device list, so we could easily move msi_list from struct pci_dev into struct device later. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Stuart Yoder <stuart.yoder@freescale.com> Cc: Yijing Wang <wangyijing@huawei.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Olof Johansson <olof@lixom.net> Cc: Gavin Shan <gwshan@linux.vnet.ibm.com> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Daniel Axtens <dja@axtens.net> Cc: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: Scott Wood <scottwood@freescale.com> Cc: Laurentiu Tudor <Laurentiu.Tudor@freescale.com> Cc: Tudor Laurentiu <b10716@freescale.com> Cc: Hongtao Jia <hongtao.jia@freescale.com> Link: http://lkml.kernel.org/r/1436428847-8886-4-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-07-21powerpc/eeh: Dump PHB diag-data for non-existing PEGavin Shan
When detecting EEH error on non-existing PE, including the reserved one, the PE is simply unfrozen without dumping the PHB diag-data, which is useful for locating the root cause of the EEH error. The patch dumps the PHB diag-data when non-existing PE reports error. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-21powerpc/eeh: Fix wrong printed PE numberGavin Shan
On LE kernel, the non-existing PE number in BE format derived from skiboot firmware isn't converted to LE format properly as following kernel log indicates: EEH: Clear non-existing PHB#4-PE#200000000000000 Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-16powerpc/powernv: Add poweroff (EPOW, DPO) events support for PowerNV platformVipin K Parashar
This patch adds support for OPAL EPOW (Environmental and Power Warnings) and DPO (Delayed Power Off) events for the PowerNV platform. These events are generated on FSP (Flexible Service Processor) based systems. EPOW events are generated due to various critical system conditions that require system shutdown. A few examples of these conditions are high ambient temperature or system running on UPS power with low UPS battery. DPO event is generated in response to admin initiated system shutdown request. Upon receipt of EPOW and DPO events the host kernel invokes orderly_poweroff() for performing graceful system shutdown. Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com> Acked-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13powerpc/powernv: Unfreeze VF PE on releasing itGavin Shan
When releasing PE for SRIOV VF, the PE is forced to be frozen wrongly. When the same PE is picked for another VF, it won't work anyhow. The patch fixes the issue by unfreezing, not freezing the VF PE when releasing it. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13powerpc/powernv: Include VF PE in PELTV of PF PEGavin Shan
The PELTV of PF PE should include VF PE, which is missed by current code, so that the VF PE is frozen automatically when freezing PF PE. The patch fixes the PELTV of PF PE to include VF PE. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13powerpc/powernv: Pick M64 PEs based on BARsGavin Shan
On PHB3, PE might be reserved in advance to reflect the M64 segments consumed by the PE according to M64 BARs (exclude VF BARs) of the PCI devices included in the PE. The PE is picked based on M64 BARs instead of the bridge's M64 windows, which might include VF BARs. Otherwise, wrong PE could be picked. The patch calculates the used M64 segments and PE numbers according to the M64 BARs, excluding VF BARs, of PCI devices in one particular PE, instead of the bridge's M64 windows. Then the right PE number is picked. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13powerpc/powernv: Boolean argument for pnv_ioda_setup_bus_PE()Gavin Shan
The patch changes the type of last argument of pnv_ioda_setup_bus_PE() and phb::pick_m64_pe() to boolean. No functional change. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13powerpc/powernv: Reserve M64 PEs based on BARsGavin Shan
On PHB3, some PEs might be reserved in advance to reflect the M64 segments consumed by those PEs. We're reserving PEs based on the M64 window of root port, which might contain VF BAR. The PEs for VFs are allocated dynamically, not reserved based on the consumed M64 segments. So the M64 window of root port isn't reliable for the task. Instead, we go through M64 BARs (VF BARs excluded) of PCI devices under the specified root bus and reserve PEs accordingly, as the patch does. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13powerpc/powernv: Allow to reserve one PE for multiple timesGavin Shan
The PE numbers are reserved according to root port's M64 window, which is aligned to M64 segment finely. So one PE shouldn't be reserved for multiple times. We will reserve PE numbers according to the M64 BARs of PCI device in subsequent patches, which aren't aligned to M64 segment size finely. It means one particular PE could be reserved for multiple times. The patch allows one PE to be reserved for multiple times and we print the warning message at debugging level. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-13powerpc/iommu: Cleanup setting of DMA base/offsetBenjamin Herrenschmidt
Now that the table and the offset can co-exist, we no longer need to flip/flop, we can just establish both once at boot time. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-06powerpc/powernv: Fix opal-elog interrupt handlerAlistair Popple
The conversion of opal events to a proper irqchip means that handlers are called until the relevant opal event has been cleared by processing it. Events that queue work should therefore use a threaded handler to mask the event until processing is complete. Signed-off-by: Alistair Popple <alistair@popple.id.au> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-06powerpc/powernv: Fix vma page prot flags in opal-prd driverVaidyanathan Srinivasan
opal-prd driver will mmap() firmware code/data area as private mapping to prd user space daemon. Write to this page will trigger COW faults. The new COW pages are normal kernel RAM pages accounted by the kernel and are not special. vma->vm_page_prot value will be used at page fault time for the new COW pages, while pgprot_t value passed in remap_pfn_range() is used for the initial page table entry. Hence: * Do not add _PAGE_SPECIAL in vma, but only for remap_pfn_range() * Also remap_pfn_range() will add the _PAGE_SPECIAL flag using pte_mkspecial() call, hence no need to specify in the driver This fix resolves the page accounting warning shown below: BUG: Bad rss-counter state mm:c0000007d34ac600 idx:1 val:19 The above warning is triggered since _PAGE_SPECIAL was incorrectly being set for the normal kernel COW pages. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Acked-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-19powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()Alexey Kardashevskiy
When pnv_pci_ioda_fixup() is called during PHB fixup time, each PE in the sorted list of PEs (phb::pe_dma_list) is iterated to setup the PE's DMA32 space by pnv_ioda_setup_bus_dma() if the PE's DMA32 weight is bigger than zero. The function also assigns all the subordinate PCI devices of the PE's primary bus with the PE's DMA32 IOMMU table. It causes the PCI devicess in the child PEs, which don't have DMA weight, receives wrong IOMMU table and then IOMMU group. The patch fixes above issue by more check on the PE's coverage and don't assign IOMMU table to those PCI devices, which belong to the child PEs. The problem was found on Firestone platform initially. Suggested-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-18powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=offAlexey Kardashevskiy
The pnv_pci_ioda2_unset_window() function is used to do the final cleanup of a DMA window being released: - via VFIO ioctl by the guest request; - via unplugging a virtual PCI function. However the function was under #ifdef CONFIG_IOMMU_API and was missing. This moves the helper outside of IOMMU_API block and enables it for either or both IOMMU_API and PCI_IOV. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-18powerpc/powernv: fix construction of opal PRD messagesJeremy Kerr
We currently have a bug in the PRD code, where the contents of an incoming message (beyond the header) will be overwritten by the list item manipulations when adding to to the prd_msg_queue. This change reorders struct opal_prd_msg_queue_item, so that the message body doesn't overlap the list_head. We also clarify the memcpy of the message, as we're copying unnecessary bytes at the end of the message data. Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-18powerpc/powernv: Increase opal-irqchip initcall priorityAlistair Popple
The eeh subsystem for powernv requires the opal event irqchip to be initialised prior to initialisation or the following errors are produced (and eeh doesn't work as expected): irq: XICS didn't like hwirq-0x9 to VIRQ17 mapping (rc=-22) pnv_eeh_post_init: Can't request OPAL event interrupt (0) On powernv eeh is initialised from a subsys_initcall due to a check for machine_is(powernv) in eeh_init(). This patch increases the initcall priority of opal_event_init() to an arch_initcall to ensure the opal event interface is initialised prior to any users of it. Signed-off-by: Alistair Popple <alistair@popple.id.au> Reported-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-15powerpc/powernv: pnv_init_idle_states() should only run on powernvMichael Ellerman
Although this init call checks for device tree properties before doing anything, it should still only run on powernv machines. Reviewed-by: Shreyas B Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership ↵Alexey Kardashevskiy
control Before the IOMMU user (VFIO) would take control over the IOMMU table belonging to a specific IOMMU group. This approach did not allow sharing tables between IOMMU groups attached to the same container. This introduces a new IOMMU ownership flavour when the user can not just control the existing IOMMU table but remove/create tables on demand. If an IOMMU implements take/release_ownership() callbacks, this lets the user have full control over the IOMMU group. When the ownership is taken, the platform code removes all the windows so the caller must create them. Before returning the ownership back to the platform code, VFIO unprograms and removes all the tables it created. This changes IODA2's onwership handler to remove the existing table rather than manipulating with the existing one. From now on, iommu_take_ownership() and iommu_release_ownership() are only called from the vfio_iommu_spapr_tce driver. Old-style ownership is still supported allowing VFIO to run on older P5IOC2 and IODA IO controllers. No change in userspace-visible behaviour is expected. Since it recreates TCE tables on each ownership change, related kernel traces will appear more often. This adds a pnv_pci_ioda2_setup_default_config() which is called when PE is being configured at boot time and when the ownership is passed from VFIO to the platform code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future tableAlexey Kardashevskiy
This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_ioda2_get_table_size() so the locked_vm counter can be updated correctly when a table is being disposed. This defines an iommu_table_group_ops callback to let VFIO know how much memory will be locked if a table is created. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE releaseAlexey Kardashevskiy
The existing code programmed TVT#0 with some address and then immediately released that memory. This makes use of pnv_pci_ioda2_unset_window() and pnv_pci_ioda2_set_bypass() which do correct resource release and TVT update. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows APIAlexey Kardashevskiy
This extends iommu_table_group_ops by a set of callbacks to support dynamic DMA windows management. create_table() creates a TCE table with specific parameters. it receives iommu_table_group to know nodeid in order to allocate TCE table memory closer to the PHB. The exact format of allocated multi-level table might be also specific to the PHB model (not the case now though). This callback calculated the DMA window offset on a PCI bus from @num and stores it in a just created table. set_window() sets the window at specified TVT index + @num on PHB. unset_window() unsets the window from specified TVT. This adds a free() callback to iommu_table_ops to free the memory (potentially a tree of tables) allocated for the TCE table. create_table() and free() are supposed to be called once per VFIO container and set_window()/unset_window() are supposed to be called for every group in a container. This adds IOMMU capabilities to iommu_table_group such as default 32bit window parameters and others. This makes use of new values in vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not advertise pagemasks to the userspace. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv: Implement multilevel TCE tablesAlexey Kardashevskiy
TCE tables might get too big in case of 4K IOMMU pages and DDW enabled on huge guests (hundreds of GB of RAM) so the kernel might be unable to allocate contiguous chunk of physical memory to store the TCE table. To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables, up to 5 levels which splits the table into a tree of smaller subtables. This adds multi-level TCE tables support to pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages() helpers. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_windowAlexey Kardashevskiy
This is a part of moving DMA window programming to an iommu_ops callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as a first parameter (not pnv_ioda_pe) as it is going to be used as a callback for VFIO DDW code. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Introduce helpers to allocate TCE pagesAlexey Kardashevskiy
This is a part of moving TCE table allocation into an iommu_ops callback to support multiple IOMMU groups per one VFIO container. This moves the code which allocates the actual TCE tables to helpers: pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages(). These do not allocate/free the iommu_table struct. This enforces window size to be a power of two. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Rework iommu_table creationAlexey Kardashevskiy
This moves iommu_table creation to the beginning to make following changes easier to review. This starts using table parameters from the iommu_table struct. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/iommu/powernv: Release replaced TCEAlexey Kardashevskiy
At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE and DMA direction so the caller can release the pages afterwards. The exchange() receives a physical address unlike set() which receives linear mapping address; and returns a physical address as the clear() does. This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement for a platform to have exchange() implemented in order to support VFIO. This replaces iommu_tce_build() and iommu_clear_tce() with a single iommu_tce_xchg(). This makes sure that TCE permission bits are not set in TCE passed to IOMMU API as those are to be calculated by platform code from DMA direction. This moves SetPageDirty() to the IOMMU code to make it work for both VFIO ioctl interface in in-kernel TCE acceleration (when it becomes available later). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>