aboutsummaryrefslogtreecommitdiff
path: root/docs/n1sdp/cmn-performance-analysis.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/n1sdp/cmn-performance-analysis.rst')
-rw-r--r--docs/n1sdp/cmn-performance-analysis.rst363
1 files changed, 0 insertions, 363 deletions
diff --git a/docs/n1sdp/cmn-performance-analysis.rst b/docs/n1sdp/cmn-performance-analysis.rst
deleted file mode 100644
index 96c4740..0000000
--- a/docs/n1sdp/cmn-performance-analysis.rst
+++ /dev/null
@@ -1,363 +0,0 @@
-CMN-600 perf example on Neoverse N1 SDP
-=======================================
-The goal of this document is to give a short introduction on CMN-600
-performance analysis on N1SDP.
-This includes driver load verification and Linux perf usage examples.
-
-The examples include examples such as system level cache access and traffic to
-and from PCI-E devices from the view of the interconnect.
-
-.. section-numbering::
- :suffix: .
-
-.. contents::
-
-
-Support in Arm's Neoverse N1 SDP software release
--------------------------------------------------
-The software support for CMN-600 performance analysis can be divided into three
-components:
-
-* The user space Linux perf tool
-* The Linux kernel arm-cmn driver
-* EDK2 (DSDT table entry)
-
-The default build of the supplied Neoverse software stack will include all
-necessary changes and patches to test and explore CMN-600 performance analysis.
-
-
-CMN-600 Topology and NodeIDs on Neoverse N1 SDP
------------------------------------------------
-The PMUs in CMN-600 are distributed to the nodes of the mesh interconnect.
-NodeType specific events are configured per node.
-Event counting is done by local counters in the XP attached to the node.
-Global counters are in the Debug Trace Controller (DTC).
-The arm-cmn driver uses local/global register pairing to provide 64-bit event
-counters (see "Counter Allocation" section below).
-
-All the nodes are referenced by NodeID and NodeType.
-PMU events must specify the NodeID of the node on which it is to be counted
-using the nodeid= parameter.
-A summary of Node ID can be found the table below.
-For more details contact support (support-subsystem-enterprise@arm.com).
-
-+---------------------------------+-----------+---------+--------------------+
-| Purpose | Node Type | Node ID | Event Name |
-+---------------------------------+-----------+---------+--------------------+
-| System-Level Cache slices (SLC) | HNF | 0x24 | arm_cmn/hnf |
-| | | 0x28 | |
-| | | 0x44 | |
-| | | 0x48 | |
-+---------------------------------+-----------+---------+--------------------+
-| PCI_CCIX (Expansion slot 4) | RND | 0x08 | arm_cmn/rnid |
-+---------------------------------+-----------+---------+--------------------+
-| PCI_0 (All other PCI-E) | RND | 0x0c | arm_cmn/rnid |
-+---------------------------------+-----------+---------+--------------------+
-| Mesh interconnections | XP | 0x00 | arm_cmn/mxp |
-| | | 0x08 | |
-| | | 0x20 | |
-| | | 0x28 | |
-| | | 0x40 | |
-| | | 0x48 | |
-| | | 0x60 | |
-| | | 0x68 | |
-+---------------------------------+-----------+---------+--------------------+
-| Debug Trace Controller | DTC | 0x68 | arm_cmn/dtc_cycles |
-+---------------------------------+-----------+---------+--------------------+
-| ACE-lite slave | SBSX | 0x64 | arm_cmn/sbsx |
-+---------------------------------+-----------+---------+--------------------+
-
-For details on what is connected to PCI_0 check the N1SDP TRM (Figure 2-9
-PCI Express and CCIX system).
-
-Software components
--------------------
-
-Linux perf tool
-###############
-No modifications of ``perf`` source is needed.
-The user can opt to use any perf compatible with the built kernel or use the
-included script ``build-scripts/build-perf.sh`` to build a static linked binary
-from the included kernel source (binary is created as
-``output/n1sdp/build_artifact/perf``).
-
-
-ACPI DSDT modification
-######################
-The Linux driver expects a DSDT entry that describe the location of the CMN-600
-configuration space.
-This is included in the supplied Neoverse software stack.
-
-Linux perf driver (arm-cmn)
-###########################
-The included arm-cmn driver is a work-in-progress.
-A Snapshot of this driver is included in the supplied Neoverse software stack.
-The driver is controller by ``CONFIG_ARM_CMN`` (enabled in default software
-stack build).
-
-Counter Allocation/Limitation
-*****************************
-The arm-cmn driver provides 64-bit event counts for any given event.
-It accomplishes this using a combination of combined-pair local counters (in an
-DTM/XP) and (uncombined) global counters (in the DTC):
-
-* DTM/XP
- Can provide up to two 32-bit local counters (built from paired 16-bit
- counters por_dtm_pmevcnt0+1, and 2+3) for events from itself and/or the
- up-to-2 devices that are connected to its ports.
-
- Overflows from these counters are sent to its DTC's global counters.
- This means only up to 2 events from any of the devices connected to an XP
- can be counted at the same time without sampling.
-
-
-* DTC
- Each DTC can provide up to 8 global counters (por_dt_pmevcntA .. H).
- This means only up to 8 events in a DTC domain can be counted at the same
- time without sampling.
-
-For example, the N1SDP's two PCI-Express root complexes' RND (PCI_CCIX on RND3
-at NodeID 0x8 and PCI0 on RND4 at NodeID 0xC), hang off of the same XP (0,1).
-Only up to 2 RND events from either of the two PCI-E domains can be measured
-simultaneously without sampling; 3 or more will require sampling.
-
-In the following example, we try to measure 4 RND events, but perf is only
-giving 50% sampling time for each count because the events have to share local
-counters in the XP.
-::
-
- $ perf stat -a \
- -e arm_cmn/rnid_txdat_flits,nodeid=8/ \
- -e arm_cmn/rnid_txdat_flits,nodeid=12/ \
- -e arm_cmn/rnid_rxdat_flits,nodeid=8/ \
- -e arm_cmn/rnid_rxdat_flits,nodeid=12/ \
- -I 1000
- # time counts unit events
- 1.000089438 0 arm_cmn/rnid_txdat_flits,nodeid=8/ (50.00%)
- 1.000089438 0 arm_cmn/rnid_txdat_flits,nodeid=12/ (50.00%)
- 1.000089438 0 arm_cmn/rnid_rxdat_flits,nodeid=8/ (50.00%)
- 1.000089438 0 arm_cmn/rnid_rxdat_flits,nodeid=12/ (50.00%)
- 2.000231897 79 arm_cmn/rnid_txdat_flits,nodeid=8/ (50.01%)
- 2.000231897 0 arm_cmn/rnid_txdat_flits,nodeid=12/ (50.01%)
- 2.000231897 0 arm_cmn/rnid_rxdat_flits,nodeid=8/ (49.99%)
-
-PMU Events
-**********
-``perf list`` shows the perfmon events for the node types that are detected by
-the arm-cmn driver.
-If a node type is not detected, perf list will not show the events for that
-node type.
-::
-
- # perf list | grep arm_cmn/hnf
- arm_cmn/hnf_brd_snoops_sent/ [Kernel PMU event]
- arm_cmn/hnf_cache_fill/ [Kernel PMU event]
- arm_cmn/hnf_cache_miss/ [Kernel PMU event]
- arm_cmn/hnf_cmp_adq_full/ [Kernel PMU event]
- arm_cmn/hnf_dir_snoops_sent/ [Kernel PMU event]
- arm_cmn/hnf_intv_dirty/ [Kernel PMU event]
- arm_cmn/hnf_ld_st_swp_adq_full/ [Kernel PMU event]
- arm_cmn/hnf_mc_reqs/ [Kernel PMU event]
- arm_cmn/hnf_mc_retries/ [Kernel PMU event]
- [...]
-
-The perfmon events are described in the CMN-600 TRM in the register description
-section for each node type's perf event selection register (at offset 0x2000 of
-each node that has a PMU).
-
-CMN-600 TRM register summary (links to all of the node types' offset registers).
- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100180_0201_00_en/bry1447971081070.html
-
-
-Specifying NodeID to events in perf
-***********************************
-To program the CMN-600's PMUs, the NodeIDs of the components need to be
-specified for each event using a nodeid= parameter.
-Example:
-::
-
- $ perf stat -a -I 1000 -e arm_cmn/hnf_mc_reqs,nodeid=0x24/
-
-
-Multiple nodes can be specified for an event using bash brace expansion.
-Note the comma after the -e.
-::
-
- $ perf stat -a -I 1000 \
- -e,arm_cmn/hnf_mc_reqs,nodeid={0x24,0x28,0x44,0x48}/
-
-
-Separate events on the same nodes with brace expansion should be specified
-using separate -e.
-::
-
- $ perf stat -a -I 1000 \
- -e,arm_cmn/hnf_mc_reqs,nodeid={0x24,0x28,0x44,0x48}/ \
- -e,arm_cmn/hnf_mc_retries,nodeid={0x24,0x28,0x44,0x48}/
-
-Note this form requires the trailing '/' at the end of the event name.
-
-
-Driver verification
--------------------
-To verify that the arm-cmn has successfully loaded different ways:
-
-* Check if any arm_cmn entires is available
- ::
-
- $ perf list | grep arm_cmn
- arm_cmn/dn_rxreq_dvmop/ [Kernel PMU event]
- arm_cmn/dn_rxreq_dvmop_vmid_filtered/ [Kernel PMU event]
- arm_cmn/dn_rxreq_dvmsync/ [Kernel PMU event]
- arm_cmn/dn_rxreq_retried/ [Kernel PMU event]
- arm_cmn/dn_rxreq_trk_occupancy_all/ [Kernel PMU event]
- arm_cmn/dn_rxreq_trk_occupancy_dvmop/ [Kernel PMU event]
- [...]
-
-
-* Sysfs entries
- ::
-
- $ ls -x /sys/bus/event_source/devices/arm_cmn/
- cpumask
- dtc_domain_0
- events
- format
- perf_event_mux_interval_ms
- power
- subsystem
- type
- uevent
-
-
-Example
--------
-
-HN-F PMU
-########
-Make sure to issue some memory load operation(s) in parallel, such as
-memtester, while executing the following perf example.
-
-Memory Bandwidth using hnf_mc_reqs
-**********************************
-Measure memory bandwidth using hnf_mc_reqs; assumes bandwidth comes from SLC
-misses.
-::
-
- $ perf stat -e,arm_cmn/hnf_mc_reqs,nodeid={0x24,0x28,0x44,0x48}/ -I 1000 -a
- 2.000394365 121,713,206 arm_cmn/hnf_mc_reqs,nodeid=0x24/
- 2.000394365 121,715,680 arm_cmn/hnf_mc_reqs,nodeid=0x28/
- 2.000394365 121,712,781 arm_cmn/hnf_mc_reqs,nodeid=0x44/
- 2.000394365 121,715,432 arm_cmn/hnf_mc_reqs,nodeid=0x48/
- 3.000644408 121,683,890 arm_cmn/hnf_mc_reqs,nodeid=0x24/
- 3.000644408 121,685,839 arm_cmn/hnf_mc_reqs,nodeid=0x28/
- 3.000644408 121,682,684 arm_cmn/hnf_mc_reqs,nodeid=0x44/
- 3.000644408 121,685,669 arm_cmn/hnf_mc_reqs,nodeid=0x48/
-
-
-Generic bandwith formula:
-::
-
- hnf_mc_reqs/second/hnf node * 64 bytes = X MB/sec
-
-Subsitute with data from perf output:
-::
-
- (121713206 + 121715680 + 121712781 + 121715432) * 64 = 29715 MB/sec
-
-
-
-PCI-E RX/TX bandwidth
-#####################
-The RN-I/RN-D events are defined from the perspective of the bridge to the
-interconnect, so the "rdata" events are outbound writes to the PCI-E device
-and "wdata" events are inbound reads from PCI-E.
-
-Measure RND (PCI-E) bandwidth to/from NVMe SSD when running fio
-***************************************************************
-The NVMe SSD (Optane SSD 900P Series) is on PCI-E Root Complex 0 (PCI0, the
-Gen3 slot, behind the PCI-E switch).
-
-Run ``fio`` to read from NVME SSD using 64KB block size for 1000 seconds in
-one terminal:
-::
-
- $ fio \
- --ioengine=libaio --randrepeat=1 --direct=1 --gtod_reduce=1 \
- --time_based --readwrite=read --bs=64k --iodepth=64k --name=r0 \
- --filename=/dev/nvme0n1p5 --numjobs=1 --runtime=10000
- r0: (g=0): rw=read, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=65536
- fio-3.1
- Starting 1 process
- ^Cbs: 1 (f=1): [R(1)][0.5%][r=2586MiB/s,w=0KiB/s][r=41.4k,w=0 IOPS][eta 16m:35s]
- fio: terminating on signal 2
-
- r0: (groupid=0, jobs=1): err= 0: pid=1443: Thu Dec 19 12:12:10 2019
- read: IOPS=41.3k, BW=2581MiB/s (2706MB/s)(12.3GiB/4894msec) <------------------------------- read bandwidth = 2706 MB/sec
- bw ( MiB/s): min= 2276, max= 2587, per=98.10%, avg=2532.02, stdev=125.43, samples=6
- iops : min=36418, max=41392, avg=40512.33, stdev=2006.90, samples=6
- cpu : usr=3.15%, sys=35.15%, ctx=16686, majf=0, minf=1049353
- IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
- submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
- complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
- issued rwt: total=202101,0,0, short=0,0,0, dropped=0,0,0
- latency : target=0, window=0, percentile=100.00%, depth=65536
-
- Run status group 0 (all jobs):
- READ: bw=2581MiB/s (2706MB/s), 2581MiB/s-2581MiB/s (2706MB/s-2706MB/s), io=12.3GiB (13.2GB), run=4894-4894msec
-
- Disk stats (read/write):
- nvme0n1: ios=202009/2, merge=0/19, ticks=4874362/51, in_queue=3934760, util=98.06%
-
-Measure with ``perf`` in an other terminal.
-Measure rdata/wdata beats. Each beat is 32 bytes.
-::
-
- $ perf stat -e,arm_cmn/rnid_s0_{r,w}data_beats,nodeid=0xc/ -I 1000 -a
- 3.000383383 248,145 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
- 3.000383383 84,728,162 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
- 4.000522271 248,199 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
- 4.000522271 84,743,908 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
- 5.000680779 248,209 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
- 5.000680779 84,746,976 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
- 6.000835927 247,899 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
- 6.000835927 84,417,098 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
-
-Calculate read bandwidth from perf measurement:
-::
-
- 84.74e6 wdata beats * 32 bytes per beat = 2711 MB/sec
-
-Measure RND (PCI-E) bandwidth from Ethernet NIC
-***********************************************
-``netperf`` is executed on the N1SDP to generate network traffic.
-
-``netperf`` executing in on terminal...
-::
-
- $ netperf -D 10 -H remote-server-in-astlab -t TCP_MAERTS -l 0
- Interim result: 941.52 10^6bits/s over 10.000 seconds ending at 1576269135.608
- Interim result: 941.52 10^6bits/s over 10.000 seconds ending at 1576269145.608
- Interim result: 941.52 10^6bits/s over 10.000 seconds ending at 1576269155.608
- Interim result: 941.52 10^6bits/s over 10.000 seconds ending at 1576269165.608
-
-...and ``perf`` in an other at the same time.
-::
-
- $ perf stat -e,arm_cmn/rnid_s0_{r,w}data_beats,nodeid=0xc/ -I 1000 -a
- 12.001904404 308,803 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
- 12.001904404 4,024,328 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
- 13.002047284 308,994 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
- 13.002047284 4,024,287 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
- 14.002233364 309,035 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
- 14.002233364 4,024,470 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
- 15.002390125 309,162 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
- 15.002390125 4,024,376 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
-
-Calculate bandwidth from perf measurement:
-::
-
- 4.024e6 wdata beats/second * 32 bytes/beat * 8 bits/byte = 1030e6 bits/second
-
-
-*Copyright (c) 2019-2020, Arm Limited. All rights reserved.*