[ Introduction | Automodified Code | Generated Code | Red Alert | Cacheable Memory | Appendix A ]
speedwar.zip (19Kb): Instruction timings, opcodes and FastBox routine.

Alain BROBECKER                                    Dracula / Positivity (STe)
rte de Dardagny                                    Baah / Arm'sTeack (Archie)
 01630 CHALLEX                                                      Baah (PC)
--------------------------------------------------------------------- July 96

                                - SPEED WAR -
               (Initiation to automodified and generated code)


    This text is aimed to assembly programmer, I doubt people programming
  in high level languages will appreciate it much. All algorithms, ideas
  and assembly code in here were made by me. (else I mention the author)
  If you use it in one of your programs, please send me a free copy of it
  and credit me. If you appreciated this text, I would be glad to know it. 


    Once again we will dive into the bowels of optimisation, but for once
  our first objective will be speed of code instead of size. Luckily, both
  those aspects of optimisation are not at all incompatible (generally the
  less instructions you have the fastest it goes, and this is especially
  true on cached processors) and so we' ll achieve a double performance.
    When programming, not only the language used is important, but so are
  the algorithms and 'programming skills' used. Just to prove this, let us
  have a small (and usefull) example...
            mov   r0,#nb                        mov   r0,#nb
         .loop                                  adr   r14,loop
            [...]                             .loop
            bl    routine                       [...]
            subS  r0,r0,#1                      subS  r0,r0,#1
            bNE   loop                          bGE   routine

    These programs are giving exactly the same results if r14 is not changed
  in the [...] part and in the routine. (if you don' t trust me, I strongly
  recommend you get a closer look!) The difference lies in their respective
  speeds: The left program uses three branchs (one 'bl', one 'mov pc,r14'
  in routine and one 'bNE') where the right program uses only two. ('bGE'
  and 'mov pc,r14' in routine) (Some people will even say we can just
  include the routine in the core of the loop, then having only one 'bNE'.
  True, but not always the best solution: suppose we have a cache and that
  routine is used in more than one loop...)
    Amongst the programming skills, the automodified and generated code are
  giving excellent results, and their name is very explicit: quite simply
  we' ll write routines which modifies themselves or create code from the
  scratch to suit best a given problem. For our purposes we' ll need to know
  the precise binary coding (=opcodes) of instructions. They can be found in
  the ARM family data manual by VLSI technology, or alternatively in a text
  called "instr_set", normally provided in this CR issue.



    I won' t give you a real example in here, I just want to introduce the
  basis of self-modified code. Sure you will find pieces of code where the
  use of such tricks will be interesting. Suppose we have the following
  loop in our proggy...  
            mov   r0,#nb
            add   r1,r1,r2,asr r4
            add   r2,r2,r3
            subS  r0,r0,#1
            bNE   loop
    It is clear that r3 and r4 are not modified in this small loop. If both
  where constants (I assume r3 is lower than 256) during the whole program
  we would have used directly their values, and this would have saved us
  time (an 'asr #imm5' is faster than 'asr r4') and registers. (r3 & r4) But
  in case r3 and r4 where coefs which were calculated before, (and so the
  loop may be called with different values in r3 & r4) are we obligated to
  use those registers to store values?
    No! (else I won' t have asked ;) Just before we enter the loop, we' ll
  modify the instructions so that they contain the constant values we' ve
  just calculated... Here' s the code performing this:
            mov   r0,#nb
            mov   r5,#&e0811042     ; Opcode of 'add r1,r1,r2,asr #0'.
            add   r5,r5,r4,lsl #7   ; Put r4 as the shift amount.
            str   r5,loop           ; Save instruction.
            strB  r3,loop+4         ; Save r3 as imm8 in 'add r2,r2,#imm8'
            mov   r0,r0             ; Beware the pipeline.
            dcd   0                 ; Will contain 'add r1,r1,r2,asr #imm5'.
            add   r2,r2,#0          ; Will contain 'add r2,r2,#imm8'.
            subS  r0,r0,#1
            bNE   loop
    Before everything, I recommend you trace this small proggy (with ArcMon
  or QDbug) in order to see the changes during execution. (Remember to assign
  values to r3,r4 and nb and to put a 'mov pc,r14') Maybe things will be
  clearer once this is done.
    You' ve noticed that I' ve put a strange instruction, the one with the
  fascinating "beware the pipeline" comment. You shall know that ARMs have a
  small memory (2 longwords) called pipeline containing the two instructions
  consecutive to the current executed one. The ARM already begin to decode
  those instructions so that it can handle them faster. When there' s a
  branch (or a swi, or a 'mov pc,??') the next instructions won' t be the
  ones in the pipeline, and so the ARM must flush it. (This explains the
  slowness of branches) For our concern, you must know that when you modify
  an instruction which is in pipeline, it will be modified in memory only,
  and the pipelined one (and non modified one) is the one which will be
  executed. (Remember that the ARM has already begun to cope with it) So, I
  had to put a nop (NoOPeration) in order to have the instructions out of
  pipeline when I modify them. This is an important remark and we' ll be
  back on such problems later, but you can try to trace the prog when the
  'mov r0,r0' has been removed. 
    You must also be aware that self-modifying code isn' t the ultimate
  solution and mustn' t be put everywhere. For example, if nb=2, sure it
  wouldn' t  be worth its price. But if the loop has to be executed a great
  number of time or if you trully lack of registers inside a loop, I bet
  automodified code is the right choice.


    At first, many people will say this isn' t much different from self-
  modified code, and I don' t agree... With automodificated code you know
  exactly what is the sequence of instruction, and what will be changed.
  We can say there are no major changes in the structures of the program.
    The objectives of generated code are different. We won' t only change
  instructions, but we' ll create from the scratch a routine which will give
  optimum speed (at least try to) for a given problem.
    Before making a code-generator, you must get a headache in devising the
  fastest way of dealing with your problem. Then, making the program which
  will create such a routine is the tedious part of the work, because you
  aren' t allowed mistakes. (Computer crashes are nicer with generated code
  than with anything else ;)
    This time, you' ll have an example and one I' m really proud of. I give
  you a routine which generates the optimal code for a rectangle filling.
  What I mean when I say optimal routine is one which contains strB for the
  edges filling, and as much stmia as possible. Also, when possible I put
  the offset to change line in a strB or str instead of making adds. The
  routine is widely commented, so I hope you will be able to understand it
  in its entirety. (Good HeadAche ;)
    When it comes to results, my routine is between 3 to 4 times faster than
  the BASIC' s Rectangle Fill on my Arm2!!! In appendix A, I give you a
  BASIC version of it, but you can find the ExtASM version with lotsa stuff
  more in the source of the small intro I coded for Coder' s Revenge. (Also
  in this issue?)


    Some ideas, just to prove you the two technics seen above are not only
  curiosities, but can be used in many graphical routines, and indeed are
  often used to speed things up...
  * Sprite Code Generator: When you want to draw a sprite, you generally
    check for transparency of a pixel before drawing it. For a given sprite,
    the result of the tests will always be the same, and a convenient way of
    speeding things is to generate a routine which will draw the visible
    pixels only, and so we avoid the tests... This trick increases speed, but
    needs a lot of memory because we have one routine per sprite, (4 routs if
    you optimise with str/stmia accesses) and problems also arises when you
    want clipped sprites.
      So far ArcheType/Arm'sTeack has made a sprite code generator. GUS, who
    recently joined Arm'sTeack has also made one, called GrimAce. Last (and
    least?) I' ve made my owns on ST, PC and Archie. So far I' ve seen two
    Archie demos using generated code for sprites, and thousands of them on
    the Atari ST.
  * If you' ve seen my CakeHead2 demo, the zoomer at the beginning of it
    was made in a way similar to the one used in FastBox. Each frame, I
    generate the code (not optimal at all, isn' t it GUS ;) which zooms
    a line, and then I use this routine for each line. GUS (again) is also
    working on such a routine (I' ve not seen it yet) and it promise to be
    a very good one (suitable for transparent zoomed sprites) with some
    clever ideas. (he keeps in mind the latests zoom routine in case he' ll
    re-use them) I' ve also seen a neat zoom code generator routines on the
    Atari. (Calimer-O demo by OXyGene)
  * Always in CakeHead2, the RSS part (the one with "fuzzy" morphing objects)
    is also using generated code. At first I was drawing randomly a filament.
    ArmOric then proposed to generate code for a certain amount of filaments,
    and randomly choose between this codes. Result is ok and incredibly
    faster than the previous version. (Except maybe on cached processors)
  * Still on the Atari ST, the fastest polygon routines (I' ve seen code of
    Ziggy Stardust+MCoder+Algernon and code of Nullos/DNT-Crew) where using
    automodified code for the edges filling. (A trick which was possible due
    to the bitplanar screen memory)
  * Surivor/MafiAmiga (the guy who learned me 68000) told about an Amiga demo
    where the blitter code was written by the copper, and the copper code was
    (as usual) written by the 68000. Double generated code, it sounds WEIRD!
    (??? demo by MetalWar/Alcatraz)
  * My friend Garry Copper/MentASM (Amiga) often uses generated code to speed
    up his demo effects, and sure he isn' t the only one.
  * Guess what... The only purpose of assemblers and compilers is to generate
    code, the difference being it doesn' t need to be made in realtime. And
    I bet the Basic interpreter makes an extensive use of generated code.
  * I heard that the Acorn Replay utility uses generated code to improve its
    efficiency on a given output.
  * As you' ve seen, I' ve made a fast box filling routine (and some more),
    I must also say the first time I heard about generated code (in an old
    ST Magazine) it was the same kind of example which was given.


    I bet not everybody will be pleased with my (I hope contagious) emphasis
  about generated code, simply because it' s something which must be used
  carefully: We' ve already seen the problems due to ARM's pipeline. Similar
  problems generally arises with cache memory. For the ones who don' t know
  what cache memory is, here comes a small explanation. (I assume you' ve
  read the part dedicated to pipeline)
    Ram is much slower than processors, so designers thought about putting
  a piece of extra-fast (and extra-expensive) ram in the core of CPUs, which
  is known as cache memory. Also, they wanted the managment of this ram to
  be "transparent" to coder by default. (by my point of view a stupidity)
  As for pipeline the cache contains instructions which are already in the
  main (and slow) memory. When executing an uncached instruction, the memory
  manager will search in main ram the instruction you want and put it in the
  cache. (also removing one, generally the "oldest")
    This is very convenient when you have a small loop... The manager loads
  all loop instructions in the cache and when it finds the final 'bNE' it
  will discover that the loop is already in cache memory and can execute it
  at maximum speed. (No more access to main ram to read instructions)
    The problem is that some cache memory aren' t write-through. (Like the
  pipeline) This means that when you write (modify code for example) to an
  adress which is duplicated in cache memory, the data in main ram will of
  course be changed but not the one in cache memory. (So we have not written
  through the cache) The result being the main ram and the cache containing
  differents values for one and same adress. The result is imprevisible and
  generally desastrous. (Major cause of incompatibility in demos and games
  between ST/Falcon, Amiga/Amiga1200)
    Fortunately, ARM processors' cache is write-through! So, generated code
  and automodified code doesn' t matter on an Archimedes as long as you take
  care of pipelining. What a good new, it really demonstrates how the ARM
  processor is well designed, and certainly makes you wonder why I called
  this chapter red alert!
    Quite simply ARM made a joint venture with Digital (sold his soul to the
  devil?) in order to create a high frequency ARM, called the StrongARM.
  Amongst other features, this processor has a deeper pipeline and a non
  write-through instruction cache. Normally, the deeper pipeline would make
  self-modified code crash, but I heard it' s not the case due to some fancy
  pipeline handling. The problem of the instruction cache memory remains.
    So what? Shall we leave generated code down in order to make proggys
  run on StrongArm? Sure not, and the other computers (Amiga, Falcon) have
  proven it is possible to make generated code with a cache. First solution
  is to turn the cache off. =) I bet you won' t be very excited by this one,
  since it means loosing all the speed inherent to cache memory. The second
  one, and this time it' s no joke, is to control the cache by ourselves. If
  you remember, I said processors designers wanted the cache management to
  be transparent to coders, but I bet there are some instructions which will
  allow us to control the cache. (At least it is the case on 680x0) And if
  this is feasable (I think so, but I can' t promise since I don' t have the
  docs... I own an Arm2 after all), you' ll be able to get the maximum power
  of your ARM because you are the one who knows what part of code needs to
  be cached and which doesn' t. (If you don' t agree, does this mean a chip
  (MemC?) is clever than you?)
    Ok, the article must come to an end now... If you are a skilled coder,
  and know everything about cache memory, what about making an article
  dealing with it so that everybody will be able to code demos working on
  every Archie computers, even StrongARMed ones, and at least show the PC
  users what real power is?

                          Having knowledge is good,
                           What about sharing it?

                                 - THE END -


    I spoke about processors with cache memory in the article dealing with
  generated code, but only considered the danger for StrongARMed RiscPCs.
  Time now to tell more about this fast piece of memory, and I' ll give
  you a little trick I' m quite proud of.
    When you perform a ldr[B] in a 'cacheable' part of memory, the processor
  will in fact load the whole line (=4 longs) containing this data in the
  cache memory. Then, if you make further access to this data line, he' ll
  remind that he already have it in cache memory, and so won' t need further
  accesses to main ram. This is very valuable when you access consecutive
  datas, or when you have frequent accesses to a same data line.
    Sometimes (rotozooms, texture mapping...) you won' t need consecutive
  datas, but distant ones instead. And if your matrix/texture or else is in
  a 'cacheable' part of memory, even if you need a single byte, the cache
  will be loaded with the full data line. 4 longs read in main ram where
  only one is needed, what a wasteage. Solution lies in the abilty to have
  memory parts marked as 'uncacheable', thus meaning that all read accesses
  will load the needed data in register as expected, and the cache won' t be
  fed up with the data line.
    So now, how can we declare a memory part as uncacheable? According to
  the documentations, this is very (VERY) tedious since you have to remap
  your datas to a logical memory section (?00000-?fffff) which is marked as
  uncacheable. So you need to control both cache and memory managment units,
  (MMU) and fooling with this might proove dangerous. (RiscOS is the one
  normally meant to be in control) Anyway, handling it seemed so boring I
  think I wouldn' t have put my hands on it.
    And then light came to my mind. I foretold the video memory was declared
  by the system as an uncacheable area... And after some tests on ArmOric' s
  RiscPC we discovered my foreseeing was true. So, no more needs to bang your
  head to the wall with all this damn! (yeah) memory management: you simply
  need to allocate more screen memory for your prog, and put all your
  critical datas in it.

;*****                                                                  *****
;*****                            APPENDIX A                            *****
;*****                                                                  *****
; Well, for once it is BASIC code, just because I wanted to ridiculise it.
; And well, not everybody do have a copy of ExtASM. But the ExtASM version
; of it can be found in the source of the 'IntroCR'.

REM==== ASM CODE ============================================================
DIMc 1024*4
[opt o
; ---------------------------------------------------------------------------
; ---                     Routine drawing a 8 bpp box                     ---
; ---                         (C) Alain BROBECKER                  May 96 ---
; ---------------------------------------------------------------------------
; * This routine draws the box between (x1;y1) and (x2;y2) on the mode13
; screen with the given filling pattern.
; * r13 is saved just after the generated code, so we can use it for the
; ldmia-stmia copy. But we have to generate the instructions which will
; restore it at the end of routine.
; * The last filling instruction (str, strB or add, to modify the adress)
; is generated with the stmia used for the endcode generation.
; * Most times (>75%) we won' t need an add to modify offsets, so I choosed
; to branch in such cases instead of cases when we have a str(B).
; * There are many other tricks (for the generation of 'bGE'..) but I won' t
; explain them since code is widely commented.
; Parameters are...
;     r0 = screen adress.
;     r1 = filling pattern.
;     r2 = x1.    1------+
;     r3 = y1.    |      |
;     r4 = x2.    |      |
;     r5 = y2.    +------2
  cmp       r2,#320                 ; At first check if the box is
  cmpLT     r3,#256                 ;   completly out of screen, and in
  movGE     pc,r14                  ;   such case we quit.
  cmp       r4,#0
  cmpGE     r5,#0
  movLT     pc,r14
  stmfd     r13!,{r0-r12,r14}       ; Be clean or die.
  cmp       r2,#0                   ; Perform the clipping.
  movLT     r2,#0
  cmp       r3,#0
  movLT     r3,#0
  cmp       r4,#320
  movGE     r4,#255
  addGE     r4,r4,#(319-255)
  cmp       r5,#256
  movGE     r5,#255
  subS      r14,r5,r3               ; r14=dy=y2-y1.
  subGES    r5,r4,r2                ; r5=dx=x2-x1.
  ldmMIfd   r13!,{r0-r12,pc}        ; Quit if dy<0 or dx<0.
  add       r3,r3,r3,lsl #2         ; r3=y1*5.
  add       r0,r0,r3,lsl #6         ; r0=screen+y1*320.
  add       r0,r0,r2                ; r0=screen+y1*320+x1.
  mov       r3,r2,lsr #2            ; r3=x1/4.
  rsbS      r3,r3,r4,lsr #2         ; r3=x2/4-x1/4=nb_longs.
  adr       r7,_small_adr           ; r7 points on adresses for small boxes.
  ldrEQ     pc,[r7,r5,lsl #2]       ; nb_longs=0, then execute small box rout.
  rsb       r5,r5,#255              ; r5=319-dx=nb of bytes to pass each line.
  add       r5,r5,#(319-255)
  adr       r6,_code+4              ; Generate code here.
  ldmdb     r7!,{r8-r11}            ; Load some opcodes.
; Here we begin to care about first longword filling.
  andS      r2,r2,#%11              ; r2=x1 mod(3).
  bEQ       _first_long_full        ; If x1 mod(3)=0, first long is full.
  subNE     r3,r3,#1                ; Else first long mustn' t be drawn.
  tst       r2,#%01                 ; Down bit of x1 set?
  strNE     r8,[r6],#4              ; Then we have an odd nb of strB.
  tst       r2,r2,lsl #1            ; bit1 and bit0 of x1 cleared?
  strEQ     r8,[r6],#4              ; Then x1 mod(3)=1 or 2, so we must
  strEQ     r8,[r6],#4              ;   generate two strB more.
  and       r4,r4,#%11              ; r4=x2 mod(3).
  and       r2,r4,r4,lsr #1         ; r2=bit1 and bit0 of x2 mod(3).
  addS      r3,r3,r2                ; If x2 mod(3)=%11, last long is full.
  bEQ       _last_longword          ; If nb_longs=0 go to last long.
  subS      r3,r3,#13               ; More than 13 longs left?
  strGE     r9,[r6],#4              ; Yes, then save one stmia max.
  bGT       _one_stmia_max          ; r3>0? Then test again.
  ldrMI     r9,[r7,r3,lsl #2]       ; If r3<0 then load opcode of last long
  strMI     r9,[r6],#4              ;   fill instruction and save it.
  teq       r4,#%11                 ; x2 mod(3)=%11?
  bEQ       _last_long_full         ; Then last long is full.
  tst       r4,#%01                 ; Down bit clear?
  strEQ     r8,[r6],#4              ; Then we have an odd nb of strB.
  teq       r4,r4,lsl #1            ; bit1 eor bit0<>0?
  strNE     r8,[r6],#4              ; Then x2 mod(3)=1 or 2, then there
  strNE     r8,[r6],#4              ;   are two strB more.
  ldr       r8,[r6,#-4]!            ; Load last saved instruction.
  tst       r8,#1<<26               ; Is it a str or strB?
  bEQ       _generate_add           ; No, then we' ll need an add.
  add       r8,r8,r5                ; Yes, then add to 319-dx to offset.
  adr       r9,_code-3*4            ; Beware the pipeline and last instruction.
  sub       r9,r9,r6                ; r9=offset for the bGE.
  mov       r9,r9,asr #2            ; r9=offset/4. (higher byte=&ff)
  eor       r9,r9,#&55<<24          ; r9=&AAxxxxxx='bGE offset'.
  stmia     r6!,{r8-r11,r13}        ; Save instructions and stack.
  mov       r2,r1                   ; Put pattern in other longwords.
  mov       r3,r1
  mov       r4,r1
  mov       r5,r1
  mov       r6,r1
  mov       r7,r1
  mov       r8,r1
  mov       r9,r1
  mov       r10,r1
  mov       r11,r1
  mov       r12,r1
  mov       r13,r1
  subS      r14,r14,#1              ; One line will be drawn.
  dcd 0:dcd 0:dcd 0                 ; Space for code and stack.
  dcd 0:dcd 0:dcd 0:dcd 0:dcd 0:dcd 0
  dcd 0:dcd 0:dcd 0
  dcd 0:dcd 0:dcd 0:dcd 0

  cmp       r5,#0                   ; Offset is null?
  bEQ       _generate_endcode       ; Then go on...
  add       r6,r6,#4                ; Don' t modify loaded instruction.
  mov       r8,#&e2<<24             ; r8=opcode of 'add r0,r0,#0'.
  add       r8,r8,#&8<<20
  cmp       r5,#255                 ; Offset bigger than 255?
  addGE     r2,r8,#255              ; Then generate an 'add r0,r0,#255'
  strGE     r2,[r6],#4
  subGE     r5,r5,#255              ;   and substract 255 to offset.
  add       r8,r8,r5                ; r8='add r0,r0,#offset'.
  b         _generate_endcode

  str       r1,[r0],#4              ; Opcodes for lasts longs filling.
  stmia     r0!,{r1-r2}
  stmia     r0!,{r1-r3}
  stmia     r0!,{r1-r4}
  stmia     r0!,{r1-r5}
  stmia     r0!,{r1-r6}
  stmia     r0!,{r1-r7}
  stmia     r0!,{r1-r8}
  stmia     r0!,{r1-r9}
  stmia     r0!,{r1-r10}
  stmia     r0!,{r1-r11}
  stmia     r0!,{r1-r12}
  strB      r1,[r0],#1              ; Byte filling instruction.
  stmia     r0!,{r1-r13}            ; Maximum filling instruction.
  ldr       r13,[pc,#0]             ; Load stack which is 8 bytes after.
  ldmfd     r13!,{r0-r12,pc}        ; And quit.
  dcd       _small1                 ; Adresses for the routines corresponding
  dcd       _small2                 ;   to x1-x2 being in same longword.
  dcd       _small3
  dcd       _small4

; Here are the routine for small boxes.
  strB      r1,[r0],#320
  subS      r14,r14,#1
  bGE       _small1
  ldmfd     r13!,{r0-r12,pc}

  strB      r1,[r0],#1
  strB      r1,[r0],#319
  subS      r14,r14,#1
  bGE       _small2
  ldmfd     r13!,{r0-r12,pc}

  strB      r1,[r0],#1
  strB      r1,[r0],#1
  strB      r1,[r0],#318
  subS      r14,r14,#1
  bGE       _small3
  ldmfd     r13!,{r0-r12,pc}

  str       r1,[r0],#320
  subS      r14,r14,#1
  bGE       _small4
  ldmfd     r13!,{r0-r12,pc}
REM==== BASIC CODE ==========================================================
DIMv 8:!v=148:v!4=-1:SYS"OS_ReadVduVariables",v,v:v=!v
 PRINT"Press Mouse to begin test"
REM Testing Rectangle Fill.
REM TEsting my Routine.
REM The results.
PRINT"To draw 2048 such rectangles:"
PRINT"  Basic RECTANGLE FILL takes ";t1/100;" sec."
PRINT"  My routine takes ";t2/100;" sec."
PRINT"  My rout is " t1/t2 " times faster..."