Performance
Caveat: Factors specified on this page are obtained from micro-benchmarks performed on specific primitive functions; in real applications factors will depend on a mix of primitives.
All benchmark tests were performed on 64-bit interpreters on Linux/Microsoft Windows operating systems.
Internal Benchmarks
Internal benchmarking was performed on the initial release of Dyalog version 18.0 and the results compared with the initial release of Dyalog version 17.0.
The benchmarking process comprises over 13,000 benchmarks in more than 130 groups; the group geometric mean timing ratios are measured and plotted against the groups sorted by their means. The vertical axis of the graph shows the ratios as a percentage change; negative values are shown in blue and indicate a performance enhancement, and positive values are shown in red and indicate a deterioration in performance.
Results showed that core interpreter performance in Dyalog version 18.0 has an average improvement of 9.7% over Dyalog version 17.0.
Areas of Focus
With two full years of performance work, Dyalog version 18.0 is a lot faster than version 17.0. Many changes are not shown in the internal benchmarking graph above as they affect cases that are not included in our benchmarks, such as high-rank arrays, arguments with different element types, or combinations of functions. Changes at version 18.0 include:
- Better cache usage – A tweak to our memory manager allows it to use space from freed pockets again. Since this memory is usually in the processor's cache, this raises the performance ceiling on the fastest operations; simple arithmetic and functions that move large blocks of data are now 3 to 4 times faster.
- Less interpreter overhead – Various primitives have been incrementally improved and the cost of invoking a dfn or operator has been reduced.
- Sets, searches and sorts – Although other functions have also been improved, the largest improvements are to the set functions dyadic ⍳ ∊ ∩ ∪ ~ and monadic ∪ (the new function unique mask (monadic ≠) benefits from the fast algorithms developed for unique as well), interval index (dyadic ⍸), and sorting (idioms {⍵[⍋⍵]} and so on).
- Special cases – The coverage of many special cases has been broadened in an effort to make "special" the normal state of affairs when working with flat arrays. A few special combinations have also been added to make previously-unavailable algorithms accessible in APL.
Some specific examples of improvements follow. Whilst these can give an idea of things to look for in Dyalog veresion 18.0, we recommend that you benchmark your own data; our estimated performance improvements were obtained using an Intel Kaby Lake i7-7500U but searching and sorting functions have performance that depends on their arguments in complicated ways and it's not possible to give a complete account of their speed.
Search functions
Significant work has been done to improve the search functions index-of (dyadic ⍳), membership (dyadic ∊), and interval index (dyadic ⍸), with several new methods developed and some old ones sped up. Some of these methods are described in more detail in the Dyalog '18 presentation Sub-nanosecond Searches Using Vector Instructions.
- Searches with a small principal argument (the left argument of ⍳ or ⍸ and the right of ∊) have been greatly improved when the SSSE3 instruction set (introduced to x86 in 2006) is available. With a principal argument of 10 4-byte elements, index-of is 5.5x faster, membership is 10.8x faster, and interval index is 9.2x faster.
- Membership on 1-byte elements has also been sped up considerably when SSSE3 is present. It is between 3x and 6x faster (sometimes more) in all cases when at least one argument is sufficiently long.
- Searching 4-byte cells is much faster: index-of with a 1000-element left argument and a much longer right argument is 3.2x faster when all elements are found, 6.4x faster when no elements are found, and 2.0x faster with 50% overlap. The numbers for membership (arguments reversed) are 8.7x, 14.8x, and 14.7x; here the performance of membership no longer depends on the amount of overlap.
- Tolerant hashing has been sped up: index-of on two thousand-element floating-point vectors with 50% overlap is 33% faster, and on complex vectors it's 24% faster.
Reverse Searching
Traditionally, search functions in APL process the principal argument – the argument that the user wants to search through – to create some kind of search structure and then use this structure to obtain a result for each element of the non-principal argument. Reverse searching is the general strategy of processing the non-principal argument instead, inspecting elements from the principal argument one at a time until all elements searched for have been located.
This strategy can make looking for a few elements in a large array much faster. The following timings are taken with a principal argument of 100,000 elements and a non-principal argument of 100 elements. The different kinds of reverse search are highlighted; unhighlighted spaces do not yet use reverse searching.
⍳ | ∊ | ⍸ | |
---|---|---|---|
1-byte int | 0.9x | 5.3x | 72x |
2-byte int | 1.1x | 1.4x | 41x |
4-byte int | 11.2x | 11.0x | 627x |
8-byte floating point | 2.0x | 2.0x | 823x |
Dyadic Set Functions
The set functions union (dyadic ∪), intersection (dyadic ∩), and without (dyadic ~) are now always computed by applying membership to the arguments in the appropriate order and using the Boolean result as a filter. As both membership and compress (dyadic / with a Boolean left argument) are very heavily optimised, this is almost always an improvement. For example, taking the union of two 1000-element vectors of 2-byte elements with 50% overlap is 2.0x faster.
The Unique Family
The functions unique (∪), unique mask (≠), and index-of (⍳) when both arguments are references to the same array, along with part of the key operator (⌸), have now been unified and use the same methods when possible. They recognise the same trivial cases, and treat flat arrays independently of their types when they are compared intolerantly. Unless the argument is mixed or nested, or floating point with ⎕CT not equal to 0, only the number of bits in a major cell determines which algorithms are available. The available algorithms have been made more performant as well – here are improvements measured on 100,000-element random vectors:
∪ | ⍳⍨ | |
---|---|---|
1-byte int | 0.9x | 5.3x |
2-byte int | 1.1x | 1.4x |
4-byte int | 11.2x | 11.0x |
8-byte floating point | 2.0x | 2.0x |
Sorting
To benefit from improvements to sorting in Dyalog version 18.0, you must be using one of the following sorting formats:
- {⍵[⍋⍵]}
- {⍵[⍒⍵]}
- {⍵[⍋⍵;]}
- {⍵[⍒⍵;]}
- {(⊂⍋⍵)⌷⍵}
- {(⊂⍒⍵)⌷⍵}
There is now dedicated sorting code for vectors of any type, as well as for arrays where the number of bits in a cell is a power of two (up to 64 bits).
Arrays with major cells of 1, 2, 4 or 8 bytes are sorted with our own version of pdqsort, hybridised with a small-range counting sort that is used when a portion of the array has a range at most 16 times its length (putting some strain on the term "small-range"). For random full-range arrays, the new sort is generally between 1x and 5x faster, with the largest improvements around length 10,000 and the smallest at lengths of a million or more. The internal benchmarks used in the graph at the top of this page showed an average 46% improvement in sorting vectors, making it the most improved group of benchmarks.
Simple vectors of complex numbers and decimal floats, as well as mixed or nested vectors, are now sorted using Timsort. Timsort is 20% to 2x faster even on random arrays, and can be hundreds of times faster for arrays with some pre-existing structure. Timsort is also now used for grade (⍋ or ⍒), resulting in similar but slightly smaller speedups.
Miscellaneous
- Splitting a string on spaces using (' '≠str)⊆str or ' '(≠⊆⊢)str is 2.0x faster when 1 in 5 characters are spaces (due to partition having been sped up).
- Using indexing to permute the columns of a 10-column matrix of 1-byte elements is 5.7x faster.
- Taking the first 3 columns of a 12-column Boolean matrix is 8.9x faster.
- Rotating the rows of a 23-column Boolean matrix by a constant amount is 8.8x faster.
- -\ of a thousand-element floating-point vector is 78x faster.
- Dyalog can now use carry-less multiply (available in x86 since 2010) for xor-reductions. This means that ≠\V on a Boolean vector V is 2.0x faster, while 25≠/V is 5.0x faster. On a matrix A with 7-element rows, ≠/A is 1.6x faster and 5≠⌿A is 6.8x faster. All cases also apply with = in place of ≠.
- Computing the floor or ceiling of a floating-point vector with a 4-byte integer result is 8.5x faster.
- The base-b logarithm of a positive array (where b is a scalar) is 1.9x faster.
- Dividing an array by an arbitrary scalar is 3.0x faster when the CPU has a fused multiply-add instruction (Intel CPUs since 2013, AMD since 2012 and all supported IBM POWER machines). This operation is more accurate than multiplying by the reciprocal and now has nearly the same speed.
The improvements in logarithm and divide apply only to arrays of real numbers and when ⎕FR=645.
Special Cases
The combination disclose-of-grade, which finds the index of the smallest or largest major cell of an array, now has special linear-time code for integer and floating-point vectors, as well as Boolean arrays of any shape. This combination must be written as an atop either in 2-train or operator form, that is, it is (⊃⍋) or ⊃⍤⍋, where ⊃ should be changed to ↑ if ⎕ML≥2, and grade down (⍒) can be used in place of grade up (⍋).
The fastest and most accurate way to sum the array A along the first axis is now 1⊥A. As +⌿A is defined to sum elements in a linear order, it cannot be evaluated quickly on a floating-point argument with small major cells. 1⊥A does not have this requirement, and now adds elements in an order that makes better use of vector instructions and accumulates less floating-point error on typical arrays. For example, summing a random vector ?1e4⍴0 with decode is 4.9x faster and has an average error 26x smaller than when using plus reduce. Similarly, 1(⊥⍤k)A is the fastest way to sum along a particular axis of A.
I(⌷⍤1 99)X is now the fastest way to perform scatter-point indexing, which selects individual elements from an array X and can be written X[↓I]. For example, selecting many elements from a shape 100 100 array of 1-byte elements with ⌷⍤1 99 is now 22x faster than using X[↓I], even though X[↓I] is 10% faster than in Dyalog version 17.1 due to improvements in split.
9 11(○⍤1 0)Y is now an efficient way to convert an array of complex numbers into pairs of real numbers giving the real and imaginary parts (adding a length-2 axes after the last axis). The rank expression (X∘f⍤k)Y is now always converted to X(f⍤((≢⍴X),k))Y, so the expression (9 11∘○⍤0)Y works equally well for this purpose.
Reduce, reduce n-wise, scan, outer product and each applied to an identity primitive function (⊣ or ⊢) or idiom ({⍺} {⍵}) are now all handled specially, as are reduce, reduce n-wise, outer product, each and rank applied to a derived function created from the new constant operator (⍨). For operators with axes, non-barred and barred forms with and without a specified axis are all optimised (previously only a few of these combinations were supported).