[ Introduction
| Automodified Code
| Generated Code
| Red Alert
| Cacheable Memory
| Appendix A ]
speedwar.zip (19Kb): Instruction timings, opcodes and FastBox routine.
Alain BROBECKER Dracula / Positivity (STe)
rte de Dardagny Baah / Arm'sTeack (Archie)
01630 CHALLEX Baah (PC)
FRANCE
--------------------------------------------------------------------- July 96
- SPEED WAR -
===========
(Initiation to automodified and generated code)
FOREWORD
This text is aimed to assembly programmer, I doubt people programming
in high level languages will appreciate it much. All algorithms, ideas
and assembly code in here were made by me. (else I mention the author)
If you use it in one of your programs, please send me a free copy of it
and credit me. If you appreciated this text, I would be glad to know it.
INTRODUCTION
Once again we will dive into the bowels of optimisation, but for once
our first objective will be speed of code instead of size. Luckily, both
those aspects of optimisation are not at all incompatible (generally the
less instructions you have the fastest it goes, and this is especially
true on cached processors) and so we' ll achieve a double performance.
When programming, not only the language used is important, but so are
the algorithms and 'programming skills' used. Just to prove this, let us
have a small (and usefull) example...
mov r0,#nb mov r0,#nb
.loop adr r14,loop
[...] .loop
bl routine [...]
subS r0,r0,#1 subS r0,r0,#1
bNE loop bGE routine
These programs are giving exactly the same results if r14 is not changed
in the [...] part and in the routine. (if you don' t trust me, I strongly
recommend you get a closer look!) The difference lies in their respective
speeds: The left program uses three branchs (one 'bl', one 'mov pc,r14'
in routine and one 'bNE') where the right program uses only two. ('bGE'
and 'mov pc,r14' in routine) (Some people will even say we can just
include the routine in the core of the loop, then having only one 'bNE'.
True, but not always the best solution: suppose we have a cache and that
routine is used in more than one loop...)
Amongst the programming skills, the automodified and generated code are
giving excellent results, and their name is very explicit: quite simply
we' ll write routines which modifies themselves or create code from the
scratch to suit best a given problem. For our purposes we' ll need to know
the precise binary coding (=opcodes) of instructions. They can be found in
the ARM family data manual by VLSI technology, or alternatively in a text
called "instr_set", normally provided in this CR issue.
AUTOMODIFIED CODE
I won' t give you a real example in here, I just want to introduce the
basis of self-modified code. Sure you will find pieces of code where the
use of such tricks will be interesting. Suppose we have the following
loop in our proggy...
mov r0,#nb
.loop
add r1,r1,r2,asr r4
add r2,r2,r3
subS r0,r0,#1
bNE loop
It is clear that r3 and r4 are not modified in this small loop. If both
where constants (I assume r3 is lower than 256) during the whole program
we would have used directly their values, and this would have saved us
time (an 'asr #imm5' is faster than 'asr r4') and registers. (r3 & r4) But
in case r3 and r4 where coefs which were calculated before, (and so the
loop may be called with different values in r3 & r4) are we obligated to
use those registers to store values?
No! (else I won' t have asked ;) Just before we enter the loop, we' ll
modify the instructions so that they contain the constant values we' ve
just calculated... Here' s the code performing this:
mov r0,#nb
mov r5,#&e0811042 ; Opcode of 'add r1,r1,r2,asr #0'.
add r5,r5,r4,lsl #7 ; Put r4 as the shift amount.
str r5,loop ; Save instruction.
strB r3,loop+4 ; Save r3 as imm8 in 'add r2,r2,#imm8'
mov r0,r0 ; Beware the pipeline.
.loop
dcd 0 ; Will contain 'add r1,r1,r2,asr #imm5'.
add r2,r2,#0 ; Will contain 'add r2,r2,#imm8'.
subS r0,r0,#1
bNE loop
Before everything, I recommend you trace this small proggy (with ArcMon
or QDbug) in order to see the changes during execution. (Remember to assign
values to r3,r4 and nb and to put a 'mov pc,r14') Maybe things will be
clearer once this is done.
You' ve noticed that I' ve put a strange instruction, the one with the
fascinating "beware the pipeline" comment. You shall know that ARMs have a
small memory (2 longwords) called pipeline containing the two instructions
consecutive to the current executed one. The ARM already begin to decode
those instructions so that it can handle them faster. When there' s a
branch (or a swi, or a 'mov pc,??') the next instructions won' t be the
ones in the pipeline, and so the ARM must flush it. (This explains the
slowness of branches) For our concern, you must know that when you modify
an instruction which is in pipeline, it will be modified in memory only,
and the pipelined one (and non modified one) is the one which will be
executed. (Remember that the ARM has already begun to cope with it) So, I
had to put a nop (NoOPeration) in order to have the instructions out of
pipeline when I modify them. This is an important remark and we' ll be
back on such problems later, but you can try to trace the prog when the
'mov r0,r0' has been removed.
You must also be aware that self-modifying code isn' t the ultimate
solution and mustn' t be put everywhere. For example, if nb=2, sure it
wouldn' t be worth its price. But if the loop has to be executed a great
number of time or if you trully lack of registers inside a loop, I bet
automodified code is the right choice.
GENERATED CODE
At first, many people will say this isn' t much different from self-
modified code, and I don' t agree... With automodificated code you know
exactly what is the sequence of instruction, and what will be changed.
We can say there are no major changes in the structures of the program.
The objectives of generated code are different. We won' t only change
instructions, but we' ll create from the scratch a routine which will give
optimum speed (at least try to) for a given problem.
Before making a code-generator, you must get a headache in devising the
fastest way of dealing with your problem. Then, making the program which
will create such a routine is the tedious part of the work, because you
aren' t allowed mistakes. (Computer crashes are nicer with generated code
than with anything else ;)
This time, you' ll have an example and one I' m really proud of. I give
you a routine which generates the optimal code for a rectangle filling.
What I mean when I say optimal routine is one which contains strB for the
edges filling, and as much stmia as possible. Also, when possible I put
the offset to change line in a strB or str instead of making adds. The
routine is widely commented, so I hope you will be able to understand it
in its entirety. (Good HeadAche ;)
When it comes to results, my routine is between 3 to 4 times faster than
the BASIC' s Rectangle Fill on my Arm2!!! In appendix A, I give you a
BASIC version of it, but you can find the ExtASM version with lotsa stuff
more in the source of the small intro I coded for Coder' s Revenge. (Also
in this issue?)
INSPIRATION
Some ideas, just to prove you the two technics seen above are not only
curiosities, but can be used in many graphical routines, and indeed are
often used to speed things up...
* Sprite Code Generator: When you want to draw a sprite, you generally
check for transparency of a pixel before drawing it. For a given sprite,
the result of the tests will always be the same, and a convenient way of
speeding things is to generate a routine which will draw the visible
pixels only, and so we avoid the tests... This trick increases speed, but
needs a lot of memory because we have one routine per sprite, (4 routs if
you optimise with str/stmia accesses) and problems also arises when you
want clipped sprites.
So far ArcheType/Arm'sTeack has made a sprite code generator. GUS, who
recently joined Arm'sTeack has also made one, called GrimAce. Last (and
least?) I' ve made my owns on ST, PC and Archie. So far I' ve seen two
Archie demos using generated code for sprites, and thousands of them on
the Atari ST.
* If you' ve seen my CakeHead2 demo, the zoomer at the beginning of it
was made in a way similar to the one used in FastBox. Each frame, I
generate the code (not optimal at all, isn' t it GUS ;) which zooms
a line, and then I use this routine for each line. GUS (again) is also
working on such a routine (I' ve not seen it yet) and it promise to be
a very good one (suitable for transparent zoomed sprites) with some
clever ideas. (he keeps in mind the latests zoom routine in case he' ll
re-use them) I' ve also seen a neat zoom code generator routines on the
Atari. (Calimer-O demo by OXyGene)
* Always in CakeHead2, the RSS part (the one with "fuzzy" morphing objects)
is also using generated code. At first I was drawing randomly a filament.
ArmOric then proposed to generate code for a certain amount of filaments,
and randomly choose between this codes. Result is ok and incredibly
faster than the previous version. (Except maybe on cached processors)
* Still on the Atari ST, the fastest polygon routines (I' ve seen code of
Ziggy Stardust+MCoder+Algernon and code of Nullos/DNT-Crew) where using
automodified code for the edges filling. (A trick which was possible due
to the bitplanar screen memory)
* Surivor/MafiAmiga (the guy who learned me 68000) told about an Amiga demo
where the blitter code was written by the copper, and the copper code was
(as usual) written by the 68000. Double generated code, it sounds WEIRD!
(??? demo by MetalWar/Alcatraz)
* My friend Garry Copper/MentASM (Amiga) often uses generated code to speed
up his demo effects, and sure he isn' t the only one.
* Guess what... The only purpose of assemblers and compilers is to generate
code, the difference being it doesn' t need to be made in realtime. And
I bet the Basic interpreter makes an extensive use of generated code.
* I heard that the Acorn Replay utility uses generated code to improve its
efficiency on a given output.
* As you' ve seen, I' ve made a fast box filling routine (and some more),
I must also say the first time I heard about generated code (in an old
ST Magazine) it was the same kind of example which was given.
RED ALERT
I bet not everybody will be pleased with my (I hope contagious) emphasis
about generated code, simply because it' s something which must be used
carefully: We' ve already seen the problems due to ARM's pipeline. Similar
problems generally arises with cache memory. For the ones who don' t know
what cache memory is, here comes a small explanation. (I assume you' ve
read the part dedicated to pipeline)
Ram is much slower than processors, so designers thought about putting
a piece of extra-fast (and extra-expensive) ram in the core of CPUs, which
is known as cache memory. Also, they wanted the managment of this ram to
be "transparent" to coder by default. (by my point of view a stupidity)
As for pipeline the cache contains instructions which are already in the
main (and slow) memory. When executing an uncached instruction, the memory
manager will search in main ram the instruction you want and put it in the
cache. (also removing one, generally the "oldest")
This is very convenient when you have a small loop... The manager loads
all loop instructions in the cache and when it finds the final 'bNE' it
will discover that the loop is already in cache memory and can execute it
at maximum speed. (No more access to main ram to read instructions)
The problem is that some cache memory aren' t write-through. (Like the
pipeline) This means that when you write (modify code for example) to an
adress which is duplicated in cache memory, the data in main ram will of
course be changed but not the one in cache memory. (So we have not written
through the cache) The result being the main ram and the cache containing
differents values for one and same adress. The result is imprevisible and
generally desastrous. (Major cause of incompatibility in demos and games
between ST/Falcon, Amiga/Amiga1200)
Fortunately, ARM processors' cache is write-through! So, generated code
and automodified code doesn' t matter on an Archimedes as long as you take
care of pipelining. What a good new, it really demonstrates how the ARM
processor is well designed, and certainly makes you wonder why I called
this chapter red alert!
Quite simply ARM made a joint venture with Digital (sold his soul to the
devil?) in order to create a high frequency ARM, called the StrongARM.
Amongst other features, this processor has a deeper pipeline and a non
write-through instruction cache. Normally, the deeper pipeline would make
self-modified code crash, but I heard it' s not the case due to some fancy
pipeline handling. The problem of the instruction cache memory remains.
So what? Shall we leave generated code down in order to make proggys
run on StrongArm? Sure not, and the other computers (Amiga, Falcon) have
proven it is possible to make generated code with a cache. First solution
is to turn the cache off. =) I bet you won' t be very excited by this one,
since it means loosing all the speed inherent to cache memory. The second
one, and this time it' s no joke, is to control the cache by ourselves. If
you remember, I said processors designers wanted the cache management to
be transparent to coders, but I bet there are some instructions which will
allow us to control the cache. (At least it is the case on 680x0) And if
this is feasable (I think so, but I can' t promise since I don' t have the
docs... I own an Arm2 after all), you' ll be able to get the maximum power
of your ARM because you are the one who knows what part of code needs to
be cached and which doesn' t. (If you don' t agree, does this mean a chip
(MemC?) is clever than you?)
Ok, the article must come to an end now... If you are a skilled coder,
and know everything about cache memory, what about making an article
dealing with it so that everybody will be able to code demos working on
every Archie computers, even StrongARMed ones, and at least show the PC
users what real power is?
Having knowledge is good,
What about sharing it?
- THE END -
P.S: ABOUT UNCACHEABLE MEMORY (Oct-Dec 96)
I spoke about processors with cache memory in the article dealing with
generated code, but only considered the danger for StrongARMed RiscPCs.
Time now to tell more about this fast piece of memory, and I' ll give
you a little trick I' m quite proud of.
When you perform a ldr[B] in a 'cacheable' part of memory, the processor
will in fact load the whole line (=4 longs) containing this data in the
cache memory. Then, if you make further access to this data line, he' ll
remind that he already have it in cache memory, and so won' t need further
accesses to main ram. This is very valuable when you access consecutive
datas, or when you have frequent accesses to a same data line.
Sometimes (rotozooms, texture mapping...) you won' t need consecutive
datas, but distant ones instead. And if your matrix/texture or else is in
a 'cacheable' part of memory, even if you need a single byte, the cache
will be loaded with the full data line. 4 longs read in main ram where
only one is needed, what a wasteage. Solution lies in the abilty to have
memory parts marked as 'uncacheable', thus meaning that all read accesses
will load the needed data in register as expected, and the cache won' t be
fed up with the data line.
So now, how can we declare a memory part as uncacheable? According to
the documentations, this is very (VERY) tedious since you have to remap
your datas to a logical memory section (?00000-?fffff) which is marked as
uncacheable. So you need to control both cache and memory managment units,
(MMU) and fooling with this might proove dangerous. (RiscOS is the one
normally meant to be in control) Anyway, handling it seemed so boring I
think I wouldn' t have put my hands on it.
And then light came to my mind. I foretold the video memory was declared
by the system as an uncacheable area... And after some tests on ArmOric' s
RiscPC we discovered my foreseeing was true. So, no more needs to bang your
head to the wall with all this damn! (yeah) memory management: you simply
need to allocate more screen memory for your prog, and put all your
critical datas in it.
;****************************************************************************
;***** *****
;***** APPENDIX A *****
;***** *****
;****************************************************************************
; Well, for once it is BASIC code, just because I wanted to ridiculise it.
; And well, not everybody do have a copy of ExtASM. But the ExtASM version
; of it can be found in the source of the 'IntroCR'.
REM==== ASM CODE ============================================================
DIMc 1024*4
FORo=0TO2STEP2
P%=c
[opt o
; ---------------------------------------------------------------------------
; --- Routine drawing a 8 bpp box ---
; --- (C) Alain BROBECKER May 96 ---
; ---------------------------------------------------------------------------
; * This routine draws the box between (x1;y1) and (x2;y2) on the mode13
; screen with the given filling pattern.
; * r13 is saved just after the generated code, so we can use it for the
; ldmia-stmia copy. But we have to generate the instructions which will
; restore it at the end of routine.
; * The last filling instruction (str, strB or add, to modify the adress)
; is generated with the stmia used for the endcode generation.
; * Most times (>75%) we won' t need an add to modify offsets, so I choosed
; to branch in such cases instead of cases when we have a str(B).
; * There are many other tricks (for the generation of 'bGE'..) but I won' t
; explain them since code is widely commented.
;
; Parameters are...
; r0 = screen adress.
; r1 = filling pattern.
; r2 = x1. 1------+
; r3 = y1. | |
; r4 = x2. | |
; r5 = y2. +------2
.fastbox256
cmp r2,#320 ; At first check if the box is
cmpLT r3,#256 ; completly out of screen, and in
movGE pc,r14 ; such case we quit.
cmp r4,#0
cmpGE r5,#0
movLT pc,r14
stmfd r13!,{r0-r12,r14} ; Be clean or die.
cmp r2,#0 ; Perform the clipping.
movLT r2,#0
cmp r3,#0
movLT r3,#0
cmp r4,#320
movGE r4,#255
addGE r4,r4,#(319-255)
cmp r5,#256
movGE r5,#255
subS r14,r5,r3 ; r14=dy=y2-y1.
subGES r5,r4,r2 ; r5=dx=x2-x1.
ldmMIfd r13!,{r0-r12,pc} ; Quit if dy<0 or dx<0.
add r3,r3,r3,lsl #2 ; r3=y1*5.
add r0,r0,r3,lsl #6 ; r0=screen+y1*320.
add r0,r0,r2 ; r0=screen+y1*320+x1.
mov r3,r2,lsr #2 ; r3=x1/4.
rsbS r3,r3,r4,lsr #2 ; r3=x2/4-x1/4=nb_longs.
adr r7,_small_adr ; r7 points on adresses for small boxes.
ldrEQ pc,[r7,r5,lsl #2] ; nb_longs=0, then execute small box rout.
rsb r5,r5,#255 ; r5=319-dx=nb of bytes to pass each line.
add r5,r5,#(319-255)
adr r6,_code+4 ; Generate code here.
ldmdb r7!,{r8-r11} ; Load some opcodes.
; Here we begin to care about first longword filling.
andS r2,r2,#%11 ; r2=x1 mod(3).
bEQ _first_long_full ; If x1 mod(3)=0, first long is full.
subNE r3,r3,#1 ; Else first long mustn' t be drawn.
tst r2,#%01 ; Down bit of x1 set?
strNE r8,[r6],#4 ; Then we have an odd nb of strB.
tst r2,r2,lsl #1 ; bit1 and bit0 of x1 cleared?
strEQ r8,[r6],#4 ; Then x1 mod(3)=1 or 2, so we must
strEQ r8,[r6],#4 ; generate two strB more.
._first_long_full
and r4,r4,#%11 ; r4=x2 mod(3).
and r2,r4,r4,lsr #1 ; r2=bit1 and bit0 of x2 mod(3).
addS r3,r3,r2 ; If x2 mod(3)=%11, last long is full.
bEQ _last_longword ; If nb_longs=0 go to last long.
._one_stmia_max
subS r3,r3,#13 ; More than 13 longs left?
strGE r9,[r6],#4 ; Yes, then save one stmia max.
bGT _one_stmia_max ; r3>0? Then test again.
ldrMI r9,[r7,r3,lsl #2] ; If r3<0 then load opcode of last long
strMI r9,[r6],#4 ; fill instruction and save it.
._last_longword
teq r4,#%11 ; x2 mod(3)=%11?
bEQ _last_long_full ; Then last long is full.
tst r4,#%01 ; Down bit clear?
strEQ r8,[r6],#4 ; Then we have an odd nb of strB.
teq r4,r4,lsl #1 ; bit1 eor bit0<>0?
strNE r8,[r6],#4 ; Then x2 mod(3)=1 or 2, then there
strNE r8,[r6],#4 ; are two strB more.
._last_long_full
ldr r8,[r6,#-4]! ; Load last saved instruction.
tst r8,#1<<26 ; Is it a str or strB?
bEQ _generate_add ; No, then we' ll need an add.
add r8,r8,r5 ; Yes, then add to 319-dx to offset.
._generate_endcode
adr r9,_code-3*4 ; Beware the pipeline and last instruction.
sub r9,r9,r6 ; r9=offset for the bGE.
mov r9,r9,asr #2 ; r9=offset/4. (higher byte=&ff)
eor r9,r9,#&55<<24 ; r9=&AAxxxxxx='bGE offset'.
stmia r6!,{r8-r11,r13} ; Save instructions and stack.
mov r2,r1 ; Put pattern in other longwords.
mov r3,r1
mov r4,r1
mov r5,r1
mov r6,r1
mov r7,r1
mov r8,r1
mov r9,r1
mov r10,r1
mov r11,r1
mov r12,r1
mov r13,r1
._code
subS r14,r14,#1 ; One line will be drawn.
dcd 0:dcd 0:dcd 0 ; Space for code and stack.
dcd 0:dcd 0:dcd 0:dcd 0:dcd 0:dcd 0
dcd 0:dcd 0:dcd 0
dcd 0:dcd 0:dcd 0:dcd 0
._generate_add
cmp r5,#0 ; Offset is null?
bEQ _generate_endcode ; Then go on...
add r6,r6,#4 ; Don' t modify loaded instruction.
mov r8,#&e2<<24 ; r8=opcode of 'add r0,r0,#0'.
add r8,r8,#&8<<20
cmp r5,#255 ; Offset bigger than 255?
addGE r2,r8,#255 ; Then generate an 'add r0,r0,#255'
strGE r2,[r6],#4
subGE r5,r5,#255 ; and substract 255 to offset.
add r8,r8,r5 ; r8='add r0,r0,#offset'.
b _generate_endcode
str r1,[r0],#4 ; Opcodes for lasts longs filling.
stmia r0!,{r1-r2}
stmia r0!,{r1-r3}
stmia r0!,{r1-r4}
stmia r0!,{r1-r5}
stmia r0!,{r1-r6}
stmia r0!,{r1-r7}
stmia r0!,{r1-r8}
stmia r0!,{r1-r9}
stmia r0!,{r1-r10}
stmia r0!,{r1-r11}
stmia r0!,{r1-r12}
._opcodes
strB r1,[r0],#1 ; Byte filling instruction.
stmia r0!,{r1-r13} ; Maximum filling instruction.
ldr r13,[pc,#0] ; Load stack which is 8 bytes after.
ldmfd r13!,{r0-r12,pc} ; And quit.
._small_adr
dcd _small1 ; Adresses for the routines corresponding
dcd _small2 ; to x1-x2 being in same longword.
dcd _small3
dcd _small4
; Here are the routine for small boxes.
._small1
strB r1,[r0],#320
subS r14,r14,#1
bGE _small1
ldmfd r13!,{r0-r12,pc}
._small2
strB r1,[r0],#1
strB r1,[r0],#319
subS r14,r14,#1
bGE _small2
ldmfd r13!,{r0-r12,pc}
._small3
strB r1,[r0],#1
strB r1,[r0],#1
strB r1,[r0],#318
subS r14,r14,#1
bGE _small3
ldmfd r13!,{r0-r12,pc}
._small4
str r1,[r0],#320
subS r14,r14,#1
bGE _small4
ldmfd r13!,{r0-r12,pc}
]
NEXTo
REM==== BASIC CODE ==========================================================
MODE13:OFF:MOUSEON
DIMv 8:!v=148:v!4=-1:SYS"OS_ReadVduVariables",v,v:v=!v
REPEAT
WAIT:WAIT:CLS
PRINT"Press Mouse to begin test"
MOUSEx,y,z
RECTANGLE FILL 0,0,x,y
UNTILz<>0
REM Testing Rectangle Fill.
TIME=0
FORi=0TO2047
RECTANGLE FILL 0,0,x,y
NEXT
t1=TIME
REM TEsting my Routine.
TIME=0
A%=v
B%=&11111111
C%=0
D%=(1023-y)>>2
E%=x>>2
F%=255
FORi=0TO2047
CALLfastbox256
NEXT
t2=TIME
REM The results.
MODE0
PRINT"To draw 2048 such rectangles:"
PRINT" Basic RECTANGLE FILL takes ";t1/100;" sec."
PRINT" My routine takes ";t2/100;" sec."
PRINT" My rout is " t1/t2 " times faster..."
END