Performance Tuning 2018 Introduction These case studies in performance tuning are collected from various
public sources.
The statement of some of them have been lightly edited.
ran←{(0,+\⍺÷+/⍺)⍸?⍵⍴0} a← 7 5 6 4 7 2 t←a ran 1e6 +/ (⍳≢a)∘.=t 226234 161255 193030 129190 225895 64396 1e6 × a÷+/a 225806 161290 193548 129032 225806 64516.1 The technique can be used to generate random numbers according to a probability distribution. If a discrete distribution with values v and corresponding probabilities p , then v[p ran ⍵] . If a continuous distribution, convert it into a discrete one by
dividing (¯∞,∞)
into an appropriate number of intervals. The interval mid points are the values;
the probabilities are the differences of the cumulative distribution function at
the interval end points.
“Improved” versions of as follows:
MCL3←{ ⍺←2 ⍝ inflation coefficient default id←{(⍵,⍵)⍴1,⍵⍴0} ⍝ identity matrix of order ⍵ NormColumn←{⍵÷⍤1+⌿⍵} Markov←{⍺ *⍨ +.×⍨ NormColumn ⍵} ⍝ one Markov iteration FP←⍺ Markov⍣≡NormColumn ⍵+id≢⍵ ⌽↑⍸(0<+/FP)⌿FP } Since NormColumn is idempotent and is the first function to be applied in Markov, it is not necessary to apply it before launching the iteration. Whence NormColumn is used just once and can be inserted in-line. Therefore: MCL4 ← {⍺←2 ⋄ ⌽↑⍸(0<+/FP)⌿FP← ⍺ {⍺ *⍨ +.×⍨ ⍵÷⍤1+⌿⍵}⍣≡ ⍵+∘.=⍨⍳≢⍵} transpose2 ← {m←↓⍉↑1⊣¨¨⍵ ⋄ m/¨↓⍉↑⍵} ⍝ John Scholes transpose3 ← {(∊⍳∘≢¨⍵) {⊂⍵}⌸ ∊⍵} transpose4 ← {((⍳+/n)-n/+\¯1↓0,n←≢¨⍵) {⊂⍵}⌸ ∊⍵} I had some trouble with this one. I worked on this in May and then revisited it the week before Dyalog ’18, and it was only then that I realized what it was doing and came up with the problem title. (That’s the key part of the solution, right there.) The following are the intermediate solutions. In 2018-05: hsu is direct from the problem statement. hsu1 is the initial alternative faster solution, produced without a full understanding of the problem. hsu2 produces a non-matching result using an alternative representation. hsu3 and hsu4 (not shown) are for an unrelated problem. In 2018-10: hsu5 is also non-matching, but can easily be made to match. I did not bother to fix it because at that point I finally understood what the problem was, and produced hsu6 and the modification of the problem statement shown in the square brackets above. hsu←{ m←⍺∘.=⍵ p←⍺⌿⍨+/m c←(≢⍵)|⍸,m p c } hsu1←{ u←∪⍵ d←{⊂⍵}⌸⍵ k←≢¨d i←u⍳⍺ b←(≢u)>i p←k[b/i]/b/⍺ c←∊d[b/i] p c } hsu2←{(({⊂⍵}⌸⍵),⊂⍬)[(∪⍵)⍳⍺]} hsu5←{ y←⍵[i←⍋⍵] b←y≠¯1⌽y ⋄ b[⍳×≢b]←1 m←(≢¨⊂)⍨b (m/b/y)i } hsu6←{(⊂p∊⍺)/¨p c ⊣ p←⍵[c←⍋⍵]} Produce some sample data and test which functions produce matching results: a←a[⍋a←95?100] b←?1e6⍴100 ∘.{⍎'a (hsu',⍺,' ≡ hsu',⍵,')b'}⍨ ' 1256' 1 1 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 1 Run some benchmarks. My own preference is for hsu6 even if it’s a bit slower, because it is so much simpler and clearer. cmpx 'a hsu'∘,¨' 1256',¨⊂' b' a hsu b → 2.85E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ a hsu1 b → 6.38E¯3 | -78% ⎕⎕⎕⎕⎕⎕⎕ * a hsu2 b → 5.29E¯3 | -82% ⎕⎕⎕⎕⎕⎕ * a hsu5 b → 5.05E¯3 | -83% ⎕⎕⎕⎕⎕ a hsu6 b → 5.88E¯3 | -80% ⎕⎕⎕⎕⎕⎕ adj1←{ u←(∪⍵,⍺)~⊂'' ⍝ the list of unique nodes B←(2⍴≢u)⍴0 ⍝ adjacency matrix B[↓u⍳(~⍺∊⊂'')⌿⍺,⍪⍵]←1 ⍝ populate adjacency matrix C←{⍵∨∨.∧⍨⍵}⍣≡B ⍝ transitive closure ⍸¨↓⍉C ⍝ list of lists of ancestors } adj2←{ u←(∪⍵,⍺)~⊂'' ⍝ the list of unique nodes f←(≢u)⍴⊂⍬ ⍝ list of fathers b←~⍺∊⊂'' f[u⍳b/k],←u⍳b/⍺ ⍝ populate list of fathers {∪¨⍵,¨⍵∘{∊⍺[⍵]}¨⍵}⍣≡f ⍝ transitive closure } From A History of APL in 50 Functions, §46: sieve←{ b←⍵⍴{∧⌿↑(×/⍵)⍴¨~⍵↑¨1}q←2 3 5 b[⍳6⌊⍵]←(6⌊⍵)⍴0 0 1 1 0 1 49≥⍵:b p←(≢q)↓⍸∇⌈⍵*0.5 m←1+⌊(⍵-1+p×p)÷2×p b ⊣ p {b[⍺×⍺+2×⍳⍵]←0}¨ m } sieve 20 0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 ⍸ sieve 20 2 3 5 7 11 13 17 19 +/ sieve 1e9 50847534 pix0 and pix1 are from the FinnAPL Idiom Library. pix2 and pix3 are slightly faster versions of same. pix0←{((⍋⍺⍳⍺,⍵)⍳⍳≢⍺)⍳(⍋⍺⍳⍵,⍺)⍳⍳≢⍵} pix1←{((⍴⍺)⍴⍋⍋⍺⍳⍺,⍵)⍳(⍴⍵)⍴⍋⍋⍺⍳⍵,⍺} pix2←{n←≢⍺ ⋄ i←⍺⍳⍺,⍵ ⋄ ((⍋i)⍳⍳≢⍺)⍳(⍋n⌽i)⍳⍳≢⍵} pix3←{n←≢⍺ ⋄ i←⍺⍳⍺,⍵ ⋄ ((⍴⍺)⍴⍋⍋i)⍳(⍴⍵)⍴⍋⍋n⌽i} u←?1e4⍴2e9 x←u[?1e5⍴≢u] y←u[?1e5⍴≢u] cmpx 'x pix0 y' 'x pix1 y' 'x pix2 y' 'x pix3 y' x pix0 y → 9.86E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ x pix1 y → 9.51E¯3 | -4% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ x pix2 y → 8.45E¯3 | -15% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ x pix3 y → 8.06E¯3 | -19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ In terms of speed, all of these benefit from the special code in the Dyalog APL interpreter for small-range data in ⍺⍳⍵ and ⍋⍵ and for permutations in ⍋⍵ . (In ⍋⍋⍵ the argument of the leftmost ⍋ is a permutation.) The following benchmarks illustrate the point: x←?1e6⍴1e6 y←x ⋄ y[999999]←1e10 ⍝ x is small-range; y is not cmpx 'x⍳x' 'y⍳y' x⍳x → 9.31E¯3 | 0% ⎕⎕ y⍳y → 1.29E¯1 | +1287% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ cmpx '⍋x' '⍋y' ⍋x → 5.11E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ * ⍋y → 7.89E¯2 | +54% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ p←1e6?1e6 ⍝ p is a permutation q←p ⋄ q[999999]←⊃q ⍝ q is not; both are small-range cmpx '⍋p' '⍋q' ⍋p → 5.81E¯3 | 0% ⎕⎕⎕ * ⍋q → 5.74E¯2 | +887% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 7. Sorting a Vector of Strings In contrast: cmpx '⍋↑x' 7.18E¯3 cmpx '{⍵[⍋↑⍵]}x' 7.56E¯3 The interpreter should detect the case where the argument consists of strings
with roughly similar lengths, i.e. does not have a few really long strings
amidst many more short strings,
and do the ↑⍵ underneath the covers.
8. Without on Vectors of Strings The computation can be sped up by working with ↑A and ↑B . See also problem #7. A←(↓⎕a[?(n,8)⍴26]),¨⍕¨?n⍴1e9 1e4 ⊣ n←100 B←(↓⎕a[?(n,8)⍴26]),¨⍕¨?n⍴1e9 1e4 ⊣ n←15000 f←{⍺⌿⍨(≢⍵)≤(≢⍵)(↑⍳↓)↑⍵,⍺} cmpx 'A~B' 'A f B' A~B → 1.79E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ A f B → 5.49E¯4 | -70% ⎕⎕⎕⎕⎕⎕⎕⎕⎕ See Hui, Roger, Inverted
Tables, Dyalog ’18, Presumably the question (the “complaint”) is why n{+⌿⍵}⌸m is slower rather than why (n⍳n){+⌿⍵}⌸m is faster. The short answer is that the current implementation has not yet exploited an opportunity for faster code on small arguments. You might think that ⍺{+⌿⍵}⌸⍵ should run at the same speed (⍺⍳⍺){+⌿⍵}⌸⍵ , because by definition key is based on ⍺⍳⍺ . In general that is true, the interpreter does ⍺⍳⍺ internally, but if the major cells of ⍺ are 1 or 2 bytes, the interpreter works with ⍺ directly, without bothering to compute ⍺⍳⍺ . In such cases the interpreter creates a table with 256 entries (1-byte ⍺) or 65536 entries (2-byte ⍺). For the argument n in question, each major cell is 2-bytes, so the table has 65536 entries, whereas major cells of n⍳n are 1-byte and the table has 256 entries. The cost comparison is
Since n has only 10 rows, the second approach wins big. On my machine: cmpx 'n{+⌿⍵}⌸m' '(n⍳n){+⌿⍵}⌸m' n{+⌿⍵}⌸m → 1.48E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ (n⍳n){+⌿⍵}⌸m → 1.21E¯6 | -92% ⎕⎕ The following benchmarks show that the comparison becomes more even as the number of rows become larger, eventually favoring the shorter solution: expr←'y{+⌿⍵}⌸x' '(y⍳y){+⌿⍵}⌸x' cmpx expr ⊣ y←n[?q⍴10;] ⊣ x←m[?q⍴10;] ⊣ q←1e2 y{+⌿⍵}⌸x → 1.53E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ (y⍳y){+⌿⍵}⌸x → 1.86E¯6 | -88% ⎕⎕⎕⎕ cmpx expr ⊣ y←n[?q⍴10;] ⊣ x←m[?q⍴10;] ⊣ q←1e3 y{+⌿⍵}⌸x → 1.91E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ (y⍳y){+⌿⍵}⌸x → 8.79E¯6 | -54% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ cmpx expr ⊣ y←n[?q⍴10;] ⊣ x←m[?q⍴10;] ⊣ q←1e4 y{+⌿⍵}⌸x → 6.22E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ (y⍳y){+⌿⍵}⌸x → 7.32E¯5 | +17% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ cmpx expr ⊣ y←n[?q⍴10;] ⊣ x←m[?q⍴10;] ⊣ q←1e5 y{+⌿⍵}⌸x → 5.01E¯4 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ (y⍳y){+⌿⍵}⌸x → 7.36E¯4 | +46% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ Up to now, we have not done much optimization for small arguments,
because the times involved are small and therefore the pay-off wouldn’t be big.
Optimizations on expressions taking a small amount of time are also hard to do.
After the Dyalog ’18 conference rush I will think about this specific case.
See the blog post Diane’s
Lasagne Problem,
|