| Performance Improvements: A Case Study Preliminary Version |
initial writing: 2008-07-15 last updated: 2008-10-06 |
Contents
0. Basic Properties of Boolean Arrays
1. Initial Ideas
2. Key Idea
3. Implementation
4. Testing
5. Functional Interface
0. Basic Properties of Boolean Arrays
Dyalog APL/W Version 12.0.1
clear ws
⎕io←0
)copy t randbool timer
randbool 3 5
1 0 0 1 1
1 0 0 1 0
0 0 0 0 1
x←randbool 8e5 8 ⋄ 100 timer '≠⌿x'
0.03422
x←randbool 4e5 16 ⋄ 100 timer '≠⌿x'
0.01766
x←randbool 2e5 32 ⋄ 100 timer '≠⌿x'
0.00906
x←randbool 1e5 64 ⋄ 100 timer '≠⌿x'
0.00516
6.4e6 ÷ 14
457142.8571
x←randbool 457143 14 ⋄ 100 timer '≠⌿x'
0.04985
The goal is to speed up f⌿x where x is an n,c boolean array with c not a multiple of 8.
1. Initial Ideas
Idea 0: pad each row up to next byte
(≠⌿x) ≡ 14 ⍴ ≠⌿ 457143 16 ↑ x
1
100 timer '14 ⍴ ≠⌿ 457143 16 ↑ x'
0.16657
Idea 1: use integers
(≠⌿x) ≡ 2|+⌿x
1
100 timer '2|+⌿x'
0.07031
2. Key Idea
Idea 2: treat argument as if it had c∧8 number of columns
⎕←t←17 14⍴⍳14
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
14∧8
56
56 ÷ 14
4
⎕←t1←5 56⍴t⍪7 14⍴0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+⌿t1
0 5 10 15 20 25 30 35 40 45 50 55 60 65 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52 0 4 8 12 16 20 24 28 32 36 40 44 48 52
4 14⍴+⌿t1
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0 4 8 12 16 20 24 28 32 36 40 44 48 52
0 4 8 12 16 20 24 28 32 36 40 44 48 52
0 4 8 12 16 20 24 28 32 36 40 44 48 52
+⌿ 4 14 ⍴ +⌿t1
0 17 34 51 68 85 102 119 136 153 170 187 204 221
+⌿t
0 17 34 51 68 85 102 119 136 153 170 187 204 221
457143 ÷ 4
114285.75
(≠⌿x) ≡ ≠⌿ 4 14 ⍴ ≠⌿ 114286 56 ⍴ x⍪7 14⍴0
1
100 timer '≠⌿ 4 14 ⍴ ≠⌿ 114286 56 ⍴ x⍪7 14⍴0'
0.00765
0.04985 ÷ 0.00765
6.516339869
In general:
slash0←{
c←¯1↑⍴⍵ ⍝ # columns
m←8∧c ⍝ LCM
d←m÷c ⍝ req'd multiple of # columns
n←d×⌈(1↑⍴⍵)÷d ⍝ round up to next multiple of d
⍺⍺⌿(d,c)⍴⍺⍺⌿((n÷d),m)⍴⍵⍪((n-1↑⍴⍵),c)⍴⍺⍺⌿⍳0
}
(≠⌿x) ≡ ≠ slash0 x
1
(=⌿x) ≡ = slash0 x
1
(∧⌿x) ≡ ∧ slash0 x
1
(∨⌿x) ≡ ∨ slash0 x
1
(+⌿x) ≡ + slash0 x
1
100 timer '≠ slash0 x'
0.00766
÷/ 100 timer¨ '≠⌿x' '≠ slash0 x'
7.211062591
÷/ 100 timer¨ '=⌿x' '= slash0 x'
7.270666667
÷/ 100 timer¨ '∧⌿x' '∧ slash0 x'
6.910987483
÷/ 100 timer¨ '∨⌿x' '∨ slash0 x'
6.974965229
÷/ 100 timer¨ '+⌿x' '+ slash0 x'
3.094653812
3. Implementation
Dyalog 12.1.x
x←randbool 457143 14
100 timer '≠⌿x'
0.00204
0.04985 ÷ 0.00204
24.43627451
J6.02
timer=: 6!:2 x=: ? 457143 14 $ 2 100 timer '~:/x' 0.00477299
APL should be faster than J, because APL has bit booleans whereas J has byte booleans.
4. Testing
The expressive power of APL and the computational power of modern machines facilitate thorough testing.
)copy t assert
⎕cr 'assert'
assert x
'assertion failure'⎕SIGNAL(0∊x)⍴11
assert 0=2|2 4 6 8 10
assert 0=2|2 4 6 8 11
assertion failure
assert 0=2|2 4 6 8 11
∧
nereduce0←{2|+⌿2+⍵}
assert 1 29 113∘.{(≠⌿x)≡nereduce0+x←randbool 137,⍵}1+⍳20
5. Functional Interface
For an interpreter with a functional interface, translating from slash0 to C takes a few minutes:
slash0←{
c←¯1↑⍴⍵ ⍝ number of columns
m←8∧c ⍝ LCM
d←m÷c ⍝ req'd multiple of # columns
n←d×⌈(1↑⍴⍵)÷d ⍝ round up to next multiple of m
⍺⍺⌿(d,c)⍴⍺⍺⌿((n÷d),m)⍴⍵⍪((n-1↑⍴⍵),c)⍴⍺⍺⌿⍳0
}
#include "globals.h"
static I gcd[8]={8,1,2,1,4,1,2,1};
F1(neslash0){I c,m,n,*s;
s=y->s; // shape
c=s[1]; // # columns
m=8*c/gcd[c%8]; // LCM(8,c)
d=m/c; // req‘d multiple of # columns
n=d*((*s+d-1)/d); // round up to next multiple of m
return neslash(reshape(v2(d,c),neslash(reshape(v2(n/d,m),cat0(y,reshape(v2(n-*s,c),zero))))));
}
globals.h
typedef long I;
typedef struct{I t,c,n,r,s[1];}*A;
#define F1(f) A f(A y)
#define F2(f) A f(A x,A y);
extern A zero;
extern F1(neslash);
extern F2(cat0);
extern F2(reshape);
extern A v2(I,I);
This implementation would have obtained nearly all of the speed-up from the key idea.
Dyalog 12.1.x
100 timer '≠ slash0 x'
0.00266
Timings:
| 0.04985 | v12.0 ≠⌿x | |
| 0.00766 | v12.0 ≠ slash0 x | |
| 0.00204 | v12.1 ≠⌿x | |
| 0.00266 | v12.1 ≠ slash0 x | |
| 0.00477 | J6.02 |