Tricks for bits/register/matrix operations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ by Alain Brobecker (aka baah/Arm's Tech) (abrobecker@yahoo.com or rte de Dardagny; 01630 CHALLEX; FRANCE) -------------------------------------------------------------------------------- Purpose: Swapping two registers From: Sylvain Langlade (aka Nullos/DNT Crew) Many computer science teacher think you need a temporary variable to exchange two variables. The following will be handy to mock them, and is also usefull when you code in assembly and have no register left. It uses the fact that bitwise operations (and, or..) are commutative and associative. eor m0,m0,m1 ;m0=aEORb eor m1,m0,m1 ;m1=(aEORb)EORb=a eor m0,m0,m1 ;m0=(aEORb)EORa=b -------------------------------------------------------------------------------- Purpose: Swapping two registers (2) From: Alain Brobecker (aka baah/Arm's Tech) And with help by Robert Harley (ThanX =) This one is easier to recall to mind, and don't use extra register either. add m0,m0,m1 ;m0=a+b sub m1,m0,m1 ;m1=a+b-b=a sub m0,m0,m1 ;m0=a+b-a=b It always work, even for values excessing the add/sub precision, because those arithmetic operations are precise modulo 2^32 on a 32 bit processor. -------------------------------------------------------------------------------- Purpose: Reversing a register From: Alain Brobecker (aka baah/Arm's Tech) The main problem is to reverse the bits of a given register, but to keep the code short I will suppose we have an 8 bits register. Adaptation to 16, 32 or more bits is quite straightforward, so I leave it to you. Suppose we want to reverse an 8 bits register m0=b7b6b5b4b3b2b1b0. A simple method is to test each bit independantly and set their opposite accordingly. It is done just below, uses 2*8+1 instructions and no extra registers. Using it on a register of n bits will need 2*n+1 instructions. mov m1,#0 ;All bits of m1 set to 0 #SET N=0 #REPT 8 tst m0,#1<0 #SET N=N+1 #ENDR We can notice that bN and bN+(8/2) are distant of 3 bits (counting left to right or right to left with wrapping) both in original and reversed registers. So we can use the following which is only 7+1 instructions long, at the expense of an extra register (1 instruction for m2 initialisation). For a register of 2*n bits this method will require 1+2*(n-1)+1=2*n instructions. mov m2,#%10001000 and m1,m2,m0,ror #1 ;m1=b0...b4... #SET N=1 #REPT 3 and m3,m2,m0,ror #N+1 ;m3=bN...bN+4... add m1,m1,m3,lsr #N ;m1+=bN...bN+4...>>N #SET N=N+1 #ENDR Fine, but I give you a method which is 6+2 instructions long, and will prove significantly faster when applied to bigger registers. For a 2^n bits register it will need (n-1) extra register and 3*(n-1)+(n-1) instructions. mov m2,#%10101010 mov m3,#%11001100 and m1,m2,m0,lsl #1 ;m1=b6.b4.b2.b0. and m0,m2,m0 ;m0=b7.b5.b3.b1. orr m0,m1,m0,lsr #1 ;m0=b6b7b4b5b2b3b0b1 and m1,m3,m0,ror #2 ;m1=b0b1..b4b5.. and m0,m3,m0,ror #4 ;m0=b2b3..b6b7.. orr m1,m1,m0,lsr #2 ;m1=b0b1b2b3b4b5b6b7 -------------------------------------------------------------------------------- Purpose: Reversing a byte (=Reversing a register 2) From: Vincent Lefevre This one uses 8 instructions and only 2 extra registers to reverse a byte. You'll be able to use it for reversing a 16 bits word, but not a whole 32 bits register (or maybe with an eor trick?). It will be faster than the one above as long as you don't have many bytes/words to reverse. orr m0,m0,lsl #8 ;m0=b7b6b5b4b3b2b1b0b7b6b5b4b3b2b1b0 and m1,m0,#&330 ;m1=..b1b0..b5b4.... and m2,m0,#&cc0 ;m2=b3b2..b7b6...... orr m0,m1,m2,lsr #4 ;m0=b1b0b3b2b5b4b7b6.. and m1,m0,#&154 ;m1=.b0.b2.b4.b6.. and m2,m0,#&2a8 ;m2=b1.b3.b5.b7... orr m0,m1,m2,lsr #2 ;m0=b0b1b2b3b4b5b6b7. mov m0,m0,lsr #1 ;m0=b0b1b2b3b4b5b6b7 -------------------------------------------------------------------------------- Purpose: Fast 90 degrees Rotation From: Alain Brobecker (aka baah/Arm's Tech) (Devised during a dull numerical analysis course =) Supposing datas are bytes, by slicing the whole matrix/texture into smaller sets, we can reduce our problem to rotation of 4*4 matrixs which will fit in four 32 bits registers. Of course this can be adapted for other data sets, such as 8*8 matrixs of nibbles (4 bits). I won't give related code here, only show what the matrix operations are. So our problem is to do: 1st long - a1a2a3a4 \ a4b4c4d4 2nd - b1b2b3b4 --------\ a3b3c3d3 3rd - c1c2c3c4 --------/ a2b2c2d2 4th - d1d2d3d4 / a1b1c1d1 My first solution is quite similar to the one used for reversing a register. An underscore in front of a longword means it was already in previous dataset and so doen' t need to be computed, but I copy it for clarity. This method needs 8+4+4+4=20 operations and 8+2 registers. With a reorganistion of the operations we can reduce to just 6+2 registers. When working with nibbles it will use 64 instructions to rotate the 8*8 matrix. a1.a3. a1b1.. .a2.a4 a2b2.. a1a2a3a4 b1.b3. a1b1a3b3 ..c3d3 a1b1c1d1 b1b2b3b4 .b2.b4 a2b2a4b4 ..c4d4 a2b2c2d2 c1c2c3c4 c1.c3. c1d1c3d3 _a1b1a3b3 a3b3c3d3 d1d2d3d4 .c2.c4 c2d2c4d4 _a2b2a4b4 a4b4c4d4 d1.d3. _c1d1c3d3 .d2.d4 _c2d2c4d4 The second method I found is more tricky (my mind shall be messy =). It requires 6+2+2+4+4=18 operations, the same amount of registers, and only 56 operations for 8*8 matrix (the EOR trick is then used twice). Using the EORs spares us the isolation sequence of two longwords, but does the same since it is bitwise oriented, associative and commutative. a1..a3.. a1a2a3a4 b1..b3.. a1b1a3b3 \ b1b2b3b4 (1) = a2a3a4a1 EOR b1b2b3b4 _(1) ---\ c1c2c3c4 c1..c3.. c1d1c3d3 ---/ d1d2d3d4 d1..d3.. _(2) / (2) = c2c3c4c1 EOR d1d2d3d4 _a1b1a3b3 a1b1.. \ _a1b1a3b3 _c1d1c3d3 a1b1c1d1 ---\ a2b2a4b4 = b1a3b3a1 EOR (1) ..c3d3 a3b3c3d3 ---/ _c1d1c3d3 _a2b2a4b4 a2b2c2d2 / c2d2c4d4 = d1c3d3c1 EOR (2) a2b2.. a4b4c4d4 _c2d2c4d4 ..c4d4 -------------------------------------------------------------------------------- Purpose: Find position of the first set bit in a register From: Alain Brobecker (aka baah/Arm's Tech) To find the first set bit of a register m0, namely INT(log(m0)/log(2)), I use a dichotomic approach. Assuming the register m0 is 2^5 bits long, the position can be found with the following sequence. The unsigned Higher or Same (HS) condition code is equivalent to Carry Set (CS). Uses 3*n instructions on a 2^n bits register. mov m1,#0 ;m1 will contain position of first set bit ]:FORc%=4TO1STEP-1:n%=2^c%:[opt opt% cmp m0,#1<>n% ]:NEXTc%:[opt opt% cmp m0,#2 addHS m1,m1,#1 -------------------------------------------------------------------------------- Purpose: Reversing endians From: Olly Betts We have m0=abcd where a-d are bytes, and we want to have m0=dcba. This first method will be handy if you have to perform the operation many times in a loop, it will take 1+3n cycles to reverse n longwords (without loop etc...) but uses two extra registers. mvn m2,#&ff00 ;m2=&ffff00ff ;The following goes in the loop eor m1,m0,m0,ror #16 ;m1=aEORc bEORd aEORc bEORd and m1,m2,m1,lsr #8 ;m1=0 aEORc 0 aEORc eor m0,m1,m0,ror #8 ;m0=dabc EOR m1=dcba -------------------------------------------------------------------------------- Purpose: Reversing endians (2) From: Tom Hughes This one is a slightly different version from the above one, which uses only one extra register, and requires 4n cycles, which is as fast if we reverse only one longword (ie n=1). eor m1,m0,m0,ror #16 ;m1=aEORc bEORd aEORc bEORd bic m1,m1,#&ff0000 ;m1=aEORc 0 aEORc bEORd mov m0,m0,ror #8 ;m0=dabc eor m0,m0,m1,lsr #8 ;m0=dcba -------------------------------------------------------------------------------- Purpose: An ultra-fast CRC-like routine From: Kostas Proitsakis (aka GUS/Arm's Tech) The following is what I use when speed is critical and CRCs are needed. => block = pointer to byte block size = size of block temp = temp reg <= crc = pseudo CRC value block = preserved size = 0 temp = corrupted mov crc,#0 .loop subs size,size,#1 ldrgeb temp,[2,3] addge crc,crc,size,ror temp bgt loop -------------------------------------------------------------------------------- Purpose: Create the mask of a sprite on the fly From: Kostas Proitsakis (aka GUS/Arm's Tech) The following creates mask for colour 0 on the fly. => d = a reg containing sprite data bm = &80808080 (for 8bit mode), &88888888 (for 4bit mode) <= d = preserved bm = preserved m = mask orr m,d,d,lsr#1 orr m,m,m,lsr#2 orr m,m,m,lsr#4 ;only for 8bit mode and m,m,bm rsb m,m,m,lsl#8 ;4 in 4bit mode -------------------------------------------------------------------------------- Purpose: Load an unaligned longword From: David McQuillan Corrected (wrong lsl/lsr) and modified by Alain Brobecker The following loads into m1 the unaligned longword found at position m0. David was using the couple "ldrEQ m1,[m0]" and "ldmNEia m0,{m1,m2}" but it uses one instruction more, and probably no time less. andS m3,m0,#3 ;get low bits and set Z flag if 0 ldmia m0,{m1,m2} ;load unaligned words movNE m3,m3,lsl #3 ;shift=low bits*8 movNE m1,m1,lsr m3 ;if unaligned, shift to right position rsbNE m3,m3,#32 orrNE m1,m1,m2,lsl m3 ;shift it, and or with first word -------------------------------------------------------------------------------- Purpose: Searching the middle value out of three From: Alain Brobecker (aka baah/Arm's Tech) (because ArmOric needed that one =) Suppose we have three values in m0, m1, m2 and we want the middle value, ie the value which would be in second position if the values were sorted. To do this i search for max(min(a;b);min(max(a;b);c)). subS m3,m0,m1 ;m3=a-b addGT m1,m3,m1 ;m1=max(a;b) subGT m0,m1,m3 ;m0=min(a;b) cmp m1,m2 ;flags=max(a;b)-c movGT m1,m2 ;m1=min(max(a;b);c) cmp m0,m1 ;flags=min(a;b)-min(max(a;b);c) movGT m1,m0 ;m1=max(min(a;b);min(max(a;b);c)) -------------------------------------------------------------------------------- Purpose: Mixing min|max values From: Alain Brobecker (aka baah/Arm's Tech) (because ArmOric needed that one too =) We have m0=max0|min0 (16 bits each) and m1=max1|min1, and we want the min|max for the whole set of values, in the same 16 bits format. cmp m0,m1 movGE m2,m1,lsl #16 movLT m2,m0,lsl #16 movLT m0,m1 ;here m0=max|min?, and m2=other min cmp m2,m0,lsl #16 addLT m2,m2,m0,lsr #16 movLT m0,m2,ror #16 -------------------------------------------------------------------------------- Purpose: Computing approx value of 1/(2^n-1) From: James Blinn We know that dividing by 2^n is the same as making an arithmetic shift right by n bits. Also, dividing by 2^n-1 can be easily computed knowing that 1/(2^n-1) ~ (2^n+1)/(2^2n), because (2^n-1)*(2^n+1)=2^2n-1, ie we have: 1 ----- = 0,0...010...010.. (n-1 zeroes between the ones) n

2 - 1 -------------------------------------------------------------------------------- Purpose: Aligning adress on the nearest 2^n value From: Alain Brobecker (aka baah/Arm's Tech) Suppose we want to align an adress on the nearest 2^n value. For example with 2^2=4 (yep, a longword ;) we have %000+3=%011; %001+3=%100; %010+3=%101 and %011+3=%110. Then we can use the following sequence: add m0,m0,#2^n-1 bic m0,m0,#2^n-1