The Tuning Pipeline 2014
Roger Hui



0. Range 0

From Primitive Performance, Dyalog ’13.

x←?1e6⍴1000

a. ¯1+{≢⍵}⌸(⍳1000),x     1.00
b. ¯1+{≢⍵}⌸(⍳1+⌈/x),x    1.62
      …

max=*x++; for(i=1;i<n;++i){if(max<*x)max=*x; ++x;}

SSE4 (Intel) or AltiVec (IBM), circa 2008

pminsb pmaxsb  8-bit integers   16 at a time;  factor of 25
pminsw pmaxsw 16-bit integers   8 at a time;  12
pminsd pmaxsd 32-bit integers   4 at a time;  6

At last year’s conference [0] I had a slide illustrating the performance of the new key operator. The only difference between line a and line b was 1000 v 1+⌈/x . I thought about it some more after the conference, and realized that the ratios show that this case of key was really fast. Unbelievable, even. Further analysis led to a novel scalar algorithm for finding the maximum of a vector, faster than the C code shown here; an algorithm that eluded even expert C programmers.

I discussed these findings with the Dyalog developers. Jay Foad pointed out that there are vector instructions [1, p. B4-138] for finding the min and the max, much faster than even the novel scalar algorithm. The factors are larger than one’d expect from the degree of parallelism, because the Intel instruction set (and therefore C) do not provide min and max as primitives for scalars.



1. Range 1

⌊/x 6,12,25
 
⌈/x 6,12,25

The preceding gives us the first set of speed-ups. A factor of 6 for 4-byte integers, 12 for 2-byte integers, and 25 for 1-byte integers.



2. Index-Of: Small v Large Range

    4-Byte Integers

  large v small
range
range-finding
v large-range
14.1 v 14.0
large-range small-range
  x ⍳ y   5.0 0.014 3.2 6.3
  x ⍳ x 2.5 0.011 5.9 6.2
  y ∊ x 5.7 0.018 4.5 9.7
    ∪ x 3.3 0.014 4.1 7.0

If ⌊/ and ⌈/ were the only speed-ups from fast range-finding, then the preceding discussion would not be very significant. However:

For the index-of family of functions, for 4-byte integers, small-range arguments have more efficient algorithms [2]. The first column shows how much faster. The second column shows the cost of determining whether the small-range algorithms are applicable. The two columns show that for an expenditure of between 1 to 2% we can potentially get a factor of 2.5 to 5.7 speed-up. Our experience has been that timing differences less than 5% can not be reliably detected.

The last two columns show the improvement from 14.0 to 14.1.



3. Index-Of: Inverted Tables

From Rank & Friends, Dyalog ’13.

inverted table index-of   {(⍉↑⍺⍳¨⍺)⍳(⍉↑⍺⍳¨⍵)}
implemented in v14.0 as   8⌶
   x ← x0 x1 x2 x3 …
   y ← y0 y1 y2 y3 …
   i ← x (8⌶) y
8⌶ exploits:
•  CRC32 instruction, SSE4.2, circa 2008
•  “selfies”, x(8⌶)x
•  other black arts ☺

Last year, we also talked about inverted table index-of [3]. This useful function was implemented as 8⌶ in Dyalog version 14.0 [4, p. 80]. For one user, 8⌶ runs 30 to 60 times faster than what’s they were using.

The speed of 8⌶ comes from using the CRC32 instruction [1, p. A3-191], from exploiting “selfies”, and from other techniques.



4. Index-Of: Other Improvements

   u←(' ',⎕a,⎕d)[?8e4 30⍴37]
   x←u[?1e5⍴≢u;] ⋄ c←,⊂x
   y←u[?1e5⍴≢u;] ⋄ d←,⊂y

   cmpx 'c(8⌶)d' 'x⍳y'
c(8⌶)d → 9.53e¯3 |    0% ⎕⎕⎕⎕
x⍳y    → 4.56e¯2 | +378% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   cmpx 'c(8⌶)c' 'x⍳x'
c(8⌶)c → 4.05e¯3 |    0% ⎕⎕
x⍳x    → 4.44e¯2 | +996% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

x⍳y is equivalent to c(8⌶)d where c and d are 1-column tables of x and y . We have not yet retrofitted the 8⌶ code back into the ordinarycode. These benchmarks show the amount of speed-up that can be expected. Note that the second benchmark is a selfie, an expression where the left and right arguments are the same.



5. Key: Special Code

Existing
14.0
{f⌿⍵}      ⊢∘(f⌿)   f: + ⌈ ⌊  ∧ ∨ = ≠
{f/⍵}  	   ⊢∘(f/)
{≢⍵}	   ⊢∘≢
{⊂⍵}	   ⊢∘⊂
To Do
{⍺,f⌿⍵}    {⍺(f⌿⍵)}
{⍺,f/⍵} 
{⍺,≢⍵}     {⍺(≢⍵)}
{⍺,(≢∪)⍵}  {⍺((≢∪)⍵)}  {(≢∪)⍵}	
           {⍺⍵}		

The key operator is new to version 14.0 [4, p. 18], and special code enhances its practical usefulness. Now that we have gotten our head around the fact that the operand function applies to both arguments, we realize that more special cases should be implemented.



6. Partition (Pseudo-) Operator

b ⊂[⎕io] x 1.3-42
 
b (f/¨ ⊂) x 30-360
   f: + ⌈ ⌊ ∧ ∨ = ≠
 
b (≢¨ ⊂) x 1.6-50



(≢¨⊂)⍨b   ⍝ partition length
b(≢¨⊂)b
≢¨b⊂b
-2-/(b/⍳≢b),⎕io+≢b
(1↓t,⎕io+≢b)-t←b/⍳≢b

We sped up partition because it plays a key role in the general case of the key operator. In doing the work, we noticed that the implemented logic does almost all required of a partition operator [6].

The last expression computes the partition lengths. It almost as short as the shortest alternative and now it is faster as well.



7. Boolean

b[i] 5.2
 
b[i]←0, b[i]←1 1.2-3.5
 
b +.f c 3
  f: × < ≤ = ≠ ≥ > ∧ ∨ ⍲ ⍱
 
m∧.=v, m≡⍤1⊢v, v≡⍤1⊢m 7
 
⍋b, ⍒b 1-1.7

The speed-ups to boolean indexing and boolean index assignment are due in part to the fast range-finding. You may have wondered earlier about finding the min and max of 1- or 2-byte integers. What use is that? Well, fast-ranging finding is just the thing for checking index bounds, and on modern machines separating out the index checking from the main processing, speeds up the main processing. (Cf. branch prediction.)

The boolean inner products compose the POPCNT instruction [1, p. B4-189], which counts the number of 1-bits in a word, with the boolean function.



8. Scalar Arithmetic Functions

÷x    (1÷x) 1.06
 
-x    (0-x) 2.2
 
x*2   (x×x) 4.2
 
x÷c   (x×÷c) 3.8


   ⎕←x←○23
72.2566
   (x÷c) - x×÷c←100
¯1.11022e¯16

These speed-ups were proposed by Eugene Ying in past conferences [7]. The last one is ruled out because we don’t want give different results for a function as important as division or multiplication.



9. Miscellaneous

⎕dr x 1-3.8
 
(≢∪)x 2.8-
 
2 f/ x 2
 
↑x 4-10
 
⊢/x, ⊣/x 3-28
 
x ⍴ scalar 2

⎕dr squeezes data into a smaller datatype. There are knock-on benefits because algorithms on smaller datatypes are often faster.

(≢∪)x finds the number of unique items.

2 f/ x , pairwise reduction, should be as fast as clumsier circumlocutions.

Mix is useful in itself, and in addition underlies the general cases of key and rank.

⊣/ and ⊢/ retrieve the first and the last column. We’d noticed that they were slow in August last year, and a recent message from certain Italians prompted us to do something about them.

The reshape of a scalar can be sped up using a straightforward trick.



10. Weird Stuff

f/x   (⊃⌽f\x)
  f: ⍱ ⍲ < > ≤ ≥
 
-\x   (+\x×(⍴x)⍴1 ¯1)
÷\x   (×\x*(⍴x)⍴1 ¯1)


   x← * 2 7 1 8 2 8
   (-\x) - +\x×(⍴x)⍴1 ¯1
0 0 0 0 0 ¯9.09495e¯13
   (÷\x) - ×\x*(⍴x)⍴1 ¯1
0 0 0 ¯8.47033e¯22 ¯6.77626e¯21 ¯3.30872e¯24

These speed-ups are proposed not because they are important, but for the following reasons:

•   Some published benchmarks [8, 9] include these expressions.
•   We want to establish the precedent that primitives are permitted to give slightly different (but still mathematically correct) results from one version to the next.



11. Talk to Us!

• Tell us if an expression is slower than you expect

• Tell us which expressions are important to your app

• Send us APLMON snapshots



b ⊂[⎕io] x 1.3-42
b[i] 5.2
b[i]←0, b[i]←1 1.2-3.5
x∧.=v 7
⎕dr x 1-3.8
(≢∪)x 2.8-
2 f/ x 2
↑x 4-10
⊢/x, ⊣/x 3-28
8⌶   (version 14.0) 30-60

The speed-ups listed here are a direct result of customers talking to us. So: Talk to us. Send us APLMON snapshots. Help us help you.



References

[0]   Hui, Roger, Primitive Performance, New Constructs: Tally with Key I, Dyalog ’13, Deerfield Beach, FL, 2013-10-22.
[1] Intel, Intel 64 and IA-32 Architectures Developer’ Manual: Volume 2A and Volume 2B, 2014-09.
[2] Hui, Roger, Exploring Index-Of, Dyalog ’14, Eastbourne, UK, 2014-09-21.
[3] Hui, Roger, Rank & Friends, Dyalog ’13, Deerfield Beach, FL, 2013-10-22.
[4] Dyalog, Dyalog APL Release Notes Version 14.0, 2014-08-30.
[5] Hui, Roger, Index-Of, A 30-Year Quest, J Conference 2014, 2014-07-25.
[6] Smith, Bob, A Programming Technique for Non-Rectangular Data, APL79, APL Quote Quad, Volume 9, Number 4, 1979-06.
[7]   Ying, Eugene, APL Optimization Techniques for Commercial Applications, Dyalog User Conference, 2013, Slides 29 and 30.
[8]   Herminghaus, Ralf, APL-Benchmark oder Wie schnell ist APL wirklich?, APL Germany User Meeting, 2012.
[9]   Herminghaus, Ralf, APL-Benchmark (Vergleich 2012-2014) oder Wie schnell ist APL denn heute?, APL Germany User Meeting, 2014-05-13.


Presented at Dyalog ’14, 2014-09-22.

created:  2014-05-17 13:15
updated:2014-09-22 14:50