Performance Improvements: A Case Study Preliminary Version |
initial writing: 2008-07-15 last updated: 2008-10-06 |
Contents
0. Basic Properties of Boolean Arrays
1. Initial Ideas
2. Key Idea
3. Implementation
4. Testing
5. Functional Interface
0. Basic Properties of Boolean Arrays
Dyalog APL/W Version 12.0.1 clear ws ⎕io←0 )copy t randbool timer randbool 3 5 1 0 0 1 1 1 0 0 1 0 0 0 0 0 1 x←randbool 8e5 8 ⋄ 100 timer '≠⌿x' 0.03422 x←randbool 4e5 16 ⋄ 100 timer '≠⌿x' 0.01766 x←randbool 2e5 32 ⋄ 100 timer '≠⌿x' 0.00906 x←randbool 1e5 64 ⋄ 100 timer '≠⌿x' 0.00516 6.4e6 ÷ 14 457142.8571 x←randbool 457143 14 ⋄ 100 timer '≠⌿x' 0.04985
The goal is to speed up f⌿x where x is an n,c boolean array with c not a multiple of 8.
1. Initial Ideas
Idea 0: pad each row up to next byte(≠⌿x) ≡ 14 ⍴ ≠⌿ 457143 16 ↑ x 1 100 timer '14 ⍴ ≠⌿ 457143 16 ↑ x' 0.16657Idea 1: use integers
(≠⌿x) ≡ 2|+⌿x 1 100 timer '2|+⌿x' 0.07031
2. Key Idea
Idea 2: treat argument as if it had c∧8 number of columns⎕←t←17 14⍴⍳14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14∧8 56 56 ÷ 14 4 ⎕←t1←5 56⍴t⍪7 14⍴0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +⌿t1 0 5 10 15 20 25 30 35 40 45 50 55 60 65 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52 4 14⍴+⌿t1 0 5 10 15 20 25 30 35 40 45 50 55 60 65 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52 +⌿ 4 14 ⍴ +⌿t1 0 17 34 51 68 85 102 119 136 153 170 187 204 221 +⌿t 0 17 34 51 68 85 102 119 136 153 170 187 204 221 457143 ÷ 4 114285.75 (≠⌿x) ≡ ≠⌿ 4 14 ⍴ ≠⌿ 114286 56 ⍴ x⍪7 14⍴0 1 100 timer '≠⌿ 4 14 ⍴ ≠⌿ 114286 56 ⍴ x⍪7 14⍴0' 0.00765 0.04985 ÷ 0.00765 6.516339869
In general:
slash0←{ c←¯1↑⍴⍵ ⍝ # columns m←8∧c ⍝ LCM d←m÷c ⍝ req'd multiple of # columns n←d×⌈(1↑⍴⍵)÷d ⍝ round up to next multiple of d ⍺⍺⌿(d,c)⍴⍺⍺⌿((n÷d),m)⍴⍵⍪((n-1↑⍴⍵),c)⍴⍺⍺⌿⍳0 } (≠⌿x) ≡ ≠ slash0 x 1 (=⌿x) ≡ = slash0 x 1 (∧⌿x) ≡ ∧ slash0 x 1 (∨⌿x) ≡ ∨ slash0 x 1 (+⌿x) ≡ + slash0 x 1 100 timer '≠ slash0 x' 0.00766 ÷/ 100 timer¨ '≠⌿x' '≠ slash0 x' 7.211062591 ÷/ 100 timer¨ '=⌿x' '= slash0 x' 7.270666667 ÷/ 100 timer¨ '∧⌿x' '∧ slash0 x' 6.910987483 ÷/ 100 timer¨ '∨⌿x' '∨ slash0 x' 6.974965229 ÷/ 100 timer¨ '+⌿x' '+ slash0 x' 3.094653812
3. Implementation
Dyalog 12.1.x x←randbool 457143 14 100 timer '≠⌿x' 0.00204 0.04985 ÷ 0.00204 24.43627451J6.02
timer=: 6!:2 x=: ? 457143 14 $ 2 100 timer '~:/x' 0.00477299
APL should be faster than J, because APL has bit booleans whereas J has byte booleans.
4. Testing
The expressive power of APL and the computational power of modern machines facilitate thorough testing.
)copy t assert ⎕cr 'assert' assert x 'assertion failure'⎕SIGNAL(0∊x)⍴11 assert 0=2|2 4 6 8 10 assert 0=2|2 4 6 8 11 assertion failure assert 0=2|2 4 6 8 11 ∧ nereduce0←{2|+⌿2+⍵} assert 1 29 113∘.{(≠⌿x)≡nereduce0+x←randbool 137,⍵}1+⍳20
5. Functional Interface
For an interpreter with a functional interface, translating from slash0 to C takes a few minutes:
slash0←{ c←¯1↑⍴⍵ ⍝ number of columns m←8∧c ⍝ LCM d←m÷c ⍝ req'd multiple of # columns n←d×⌈(1↑⍴⍵)÷d ⍝ round up to next multiple of m ⍺⍺⌿(d,c)⍴⍺⍺⌿((n÷d),m)⍴⍵⍪((n-1↑⍴⍵),c)⍴⍺⍺⌿⍳0 } #include "globals.h" static I gcd[8]={8,1,2,1,4,1,2,1}; F1(neslash0){I c,m,n,*s; s=y->s; // shape c=s[1]; // # columns m=8*c/gcd[c%8]; // LCM(8,c) d=m/c; // req‘d multiple of # columns n=d*((*s+d-1)/d); // round up to next multiple of m return neslash(reshape(v2(d,c),neslash(reshape(v2(n/d,m),cat0(y,reshape(v2(n-*s,c),zero)))))); }globals.h
typedef long I; typedef struct{I t,c,n,r,s[1];}*A; #define F1(f) A f(A y) #define F2(f) A f(A x,A y); extern A zero; extern F1(neslash); extern F2(cat0); extern F2(reshape); extern A v2(I,I);
This implementation would have obtained nearly all of the speed-up from the key idea.
Dyalog 12.1.x 100 timer '≠ slash0 x' 0.00266Timings:
0.04985 | v12.0 ≠⌿x | |
0.00766 | v12.0 ≠ slash0 x | |
0.00204 | v12.1 ≠⌿x | |
0.00266 | v12.1 ≠ slash0 x | |
0.00477 | J6.02 |