The Tuning Pipeline 2014 0. Range 0
At last year’s conference [0] I had a slide illustrating the performance of the new key operator. The only difference between line a and line b was 1000 v 1+⌈/x . I thought about it some more after the conference, and realized that the ratios show that this case of key was really fast. Unbelievable, even. Further analysis led to a novel scalar algorithm for finding the maximum of a vector, faster than the C code shown here; an algorithm that eluded even expert C programmers. I discussed these findings with the Dyalog developers. Jay Foad pointed out that there are vector instructions [1, p. B4-138] for finding the min and the max, much faster than even the novel scalar algorithm. The factors are larger than one’d expect from the degree of parallelism, because the Intel instruction set (and therefore C) do not provide min and max as primitives for scalars. 1. Range 1
The preceding gives us the first set of speed-ups. A factor of 6 for 4-byte integers, 12 for 2-byte integers, and 25 for 1-byte integers. 2. Index-Of: Small v Large Range
If ⌊/ and ⌈/ were the only speed-ups from fast range-finding, then the preceding discussion would not be very significant. However: For the index-of family of functions, for 4-byte integers, small-range arguments have more efficient algorithms [2]. The first column shows how much faster. The second column shows the cost of determining whether the small-range algorithms are applicable. The two columns show that for an expenditure of between 1 to 2% we can potentially get a factor of 2.5 to 5.7 speed-up. Our experience has been that timing differences less than 5% can not be reliably detected. The last two columns show the improvement from 14.0 to 14.1. 3. Index-Of: Inverted Tables
Last year, we also talked about inverted table index-of [3]. This useful function was implemented as 8⌶ in Dyalog version 14.0 [4, p. 80]. For one user, 8⌶ runs 30 to 60 times faster than what’s they were using. The speed of 8⌶ comes from using the CRC32 instruction [1, p. A3-191], from exploiting “selfies”, and from other techniques. 4. Index-Of: Other Improvements
x⍳y is equivalent to c(8⌶)d where c and d are 1-column tables of x and y . We have not yet retrofitted the 8⌶ code back into the ordinary ⍳ code. These benchmarks show the amount of speed-up that can be expected. Note that the second benchmark is a selfie, an expression where the left and right arguments are the same. 5. Key: Special Code
The key operator is new to version 14.0 [4, p. 18], and special code enhances its practical usefulness. Now that we have gotten our head around the fact that the operand function applies to both arguments, we realize that more special cases should be implemented. 6. Partition (Pseudo-) Operator
We sped up partition because it plays a key role in the general case of the key operator. In doing the work, we noticed that the implemented logic does almost all required of a partition operator [6]. The last expression computes the partition lengths. It almost as short as the shortest alternative and now it is faster as well. 7. Boolean
The speed-ups to boolean indexing and boolean index assignment are due in part to the fast range-finding. You may have wondered earlier about finding the min and max of 1- or 2-byte integers. What use is that? Well, fast-ranging finding is just the thing for checking index bounds, and on modern machines separating out the index checking from the main processing, speeds up the main processing. (Cf. branch prediction.) The boolean inner products compose the POPCNT instruction [1, p. B4-189], which counts the number of 1-bits in a word, with the boolean function. 8. Scalar Arithmetic Functions
These speed-ups were proposed by Eugene Ying in past conferences [7]. The last one is ruled out because we don’t want give different results for a function as important as division or multiplication. 9. Miscellaneous
⎕dr squeezes data into a smaller datatype. There are knock-on benefits because algorithms on smaller datatypes are often faster. (≢∪)x finds the number of unique items. 2 f/ x , pairwise reduction, should be as fast as clumsier circumlocutions. Mix is useful in itself, and in addition underlies the general cases of key and rank. ⊣/ and ⊢/ retrieve the first and the last column. We’d noticed that they were slow in August last year, and a recent message from certain Italians prompted us to do something about them. The reshape of a scalar can be sped up using a straightforward trick. 10. Weird Stuff
These speed-ups are proposed not because they are important, but for the following reasons:
11. Talk to Us!
The speed-ups listed here are a direct result of customers talking to us. So: Talk to us. Send us APLMON snapshots. Help us help you. References
Presented at Dyalog ’14, 2014-09-22.
|