1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
|
data/bss layout: uses different sections ordered by minimum addressing scale.
no separate .rodata section(s).
.data16: scaling factor 16
.data8, .data4, data2, .data1: likewise for smaller scaling factors
.bss1, .bss2, .bss4, .bss8, .bss16: bss sections for increasing scaling
factors
The data base pointer register i9 typically points at the place where .bss1
ends and .data1 starts. It might be moved up or down if allocation
would otherwise overflow on one side, and on the other side is slack.
Tasks to be done:
- Convert this document into a proper texinfo file, incorporate it into
gcc ducumentation, and test 'make info'
- binutils support for using undefined labels in mxp data/bss sections
as offsets in memory addresses.
- binutils support for mxp code labels. For a start, we are looking to
have a special text section where to put all the mxp code. At link time,
this special text section is considered to be loaded at the start of the
SCM for purposes of resolving SCM absolute relocations. However, the
code gets actually a different load address for the ARC700 core, and gets
a j_s [blink] instruction appended (extra points if you make this a j_s.d
[blink] before the last insn without the potential to break stuff...)
Later we will likely want to move to multiple of such special text sections
to handle overlays, and possibly also have different load addreses to
accomodate multiple overlays. If we want to be able to handle SCM PIE,
I.e. code that can be loaded to varying SCM locations, the arc will need
to load an a core register with the SCM load address before calling the
SCQ loading code, and the latter will have to use add instructions to
calculate SCM locations on the fly.
No matter if we use such add instructions, or long immediates, instructions
that reference SCM memory locations work out as 64 bit of code on the
arc side, while the other SIMD instructions are injected with a single
32 bit code from the arc side. Thus we have a discrepancy between the
space taken up by the instructions in the object file and the size we
have to consider for purposes of calculating SCM addresses.
Luckily, these differences are constant from the first time the SIMD
assembly is emitted. Thus, the total number of instructions
with SCM references that precede an SCM label gives us the number of
32 bit words to subtract from the total number of preceding 32 bits words
to arrive at the offset from the SCM load address.
To account for preceding SCM references in the same module, we can make
the SCM label appear to be accordingly earlier in the module.
(This will have to be compensated for if we want to do any linktime
relaxation at some later point in time.)
We also need to keep a tally of the total number of SCM references in each
module.
When linking multiple modules together, the total of these tallies for all
preceding modules needs to be added up, and subtracted from the value of
each label.
Like SCM references, (other) long immediates bulk up the code on the arc
side while leaving the SIMD instruction count the same, so they have to
be tallied up together with the SCM references.
- library functions:
- divsi3: use sh64 code as starting point. Note that there is no
point in loading the table base address before the function call, because
all SCM memory addressing has an offset.
divv8hi3, divv4si3: use older sh64 code w/out lookup table as starting
point
- divhi3
- Investigate register class preferencing issues. Naming lane sets with
lane 0 first actually results in the wrong reg_class_subunions. In theory
the ordierng should be something like 00, 10, 01, 30, 03, ff, to get the
sets with lane zero prefered for subunions. preferred classes can be
seen in the *lreg dump file after compiling with -da. Another avenue to
saner subunions is to add proper union lane sets 11, 33.
The paradoxical thing I am seeing here is that the instruction count for
muldi increases when I introduce these measures.
Another - or complimentary - approach is to shift the cost balance.
in theory REGISTER_MOVE_COST should have an influence, but in practice
I haven't seen any. What works is adding extra cost to insn alternatives
which allow non-lane0 registers. A problem here - and in general - is that
we want a viable alternate register class. Jacking up the cost for
non-lane0 alternatives can disparage these to the point that we loose the
altclass. We also have often altclasses that don't actually contain any
extra valid registers. In theory increasing MEMORY_MOVE_COST can
compensate, however I see paradoxical outcomes when I try to make this
dependent on !(reload_in_progress || reload_completed). I have a diff
for some of the changes I've tried in
/home/joernr/prefclass-experiments-20080428.
Maybe we ned to jackup REGISTER_MOVE_COST, MEMORY_MOVE_COST and RTX_COST
consistently to get a more fine-grained resolution of costs.
- Obtain code samples of code that we think is suitable and relevant for
autovectorization. E.g. some codec.
Dependent tasks:
- Identify the actual section of this code that we think we should be
able to autovectorize.
- Make sure autovectorization takes place.
- Partitioning work. Check with IBM Haifa and other Milepost partners
what they already have.
Inasmuch as not already done:
- Identify individual functions and subgraphs of the callgraph we can move
to the SIMD engine.
- Add code to tree loop analysis to break out loops that we can move to
the SIMD engine.
- Handle data sets that don't fit into SDM. The simplest to implement
approach is probably to do loop tiling at the interface between arc core
and simd engine. OTOH we can get much better parallelism if we hand
over the entire work to the simd engine and let it DMA out the previoud
block, and DMA in the next block, while it is performing calculations.
For this we need to represent main memory pointers.
Need not necessarilty be exposed as pointers to the mxp-gcc, we could
express the loop tiling with intrinsics.
- Add doloop pattern
- Convert multi-insn define_insn patterns into define_insn_and_split patterns.
- Add scheduler description
- Where missing, add comments to the code according to GNU coding standards.
|