Performance Tuning 2018
Case Studies




Introduction

These case studies in performance tuning are collected from various public sources. The statement of some of them have been lightly edited.
 

0. Random Numbers

   ran←{(0,+\⍺÷+/⍺)⍸?⍵⍴0}

   a← 7 5 6 4 7 2
   t←a ran 1e6

   +/ (⍳≢a)∘.=t
226234 161255 193030 129190 225895 64396
   1e6 × a÷+/a
225806 161290 193548 129032 225806 64516.1

The technique can be used to generate random numbers according to a probability distribution.

If a discrete distribution with values v and corresponding probabilities p , then v[p ran ⍵] .

If a continuous distribution, convert it into a discrete one by dividing (¯∞,∞) into an appropriate number of intervals. The interval mid points are the values; the probabilities are the differences of the cumulative distribution function at the interval end points.
 

1. Markov Cluster

“Improved” versions of as follows:
•  separate code to create the matrix and to do the Markov iteration
trivial functions included in-line and left unnamed

MCL3←{
  ⍺←2                       ⍝ inflation coefficient default
  id←{(⍵,⍵)⍴1,⍵⍴0}          ⍝ identity matrix of order ⍵   
  NormColumn←{⍵÷⍤1+⌿⍵}
  Markov←{⍺ *⍨ +.×⍨ NormColumn ⍵}  ⍝ one Markov iteration
  FP←⍺ Markov⍣≡NormColumn ⍵+id≢⍵
  ⌽↑⍸(0<+/FP)⌿FP
}

Since NormColumn is idempotent and is the first function to be applied in Markov, it is not necessary to apply it before launching the iteration. Whence NormColumn is used just once and can be inserted in-line. Therefore:

MCL4 ← {⍺←2 ⋄ ⌽↑⍸(0<+/FP)⌿FP← 
           ⍺ {⍺ *⍨ +.×⍨ ⍵÷⍤1+⌿⍵}⍣≡ ⍵+∘.=⍨⍳≢⍵}

2. Transpose Ragged Array

transpose2 ← {m←↓⍉↑1⊣¨¨⍵ ⋄ m/¨↓⍉↑⍵}  ⍝ John Scholes
transpose3 ← {(∊⍳∘≢¨⍵) {⊂⍵}⌸ ∊⍵}
transpose4 ← {((⍳+/n)-n/+\¯1↓0,n←≢¨⍵) {⊂⍵}⌸ ∊⍵}

3. Indices of Occurrences

I had some trouble with this one. I worked on this in May and then revisited it the week before Dyalog ’18, and it was only then that I realized what it was doing and came up with the problem title. (That’s the key part of the solution, right there.)

The following are the intermediate solutions. In 2018-05: hsu is direct from the problem statement. hsu1 is the initial alternative faster solution, produced without a full understanding of the problem. hsu2 produces a non-matching result using an alternative representation. hsu3 and hsu4 (not shown) are for an unrelated problem. In 2018-10: hsu5 is also non-matching, but can easily be made to match. I did not bother to fix it because at that point I finally understood what the problem was, and produced hsu6 and the modification of the problem statement shown in the square brackets above.

hsu←{
  m←⍺∘.=⍵
  p←⍺⌿⍨+/m
  c←(≢⍵)|⍸,m
  p c
}

hsu1←{
  u←∪⍵
  d←{⊂⍵}⌸⍵
  k←≢¨d
  i←u⍳⍺
  b←(≢u)>i
  p←k[b/i]/b/⍺
  c←∊d[b/i]
  p c
}

hsu2←{(({⊂⍵}⌸⍵),⊂⍬)[(∪⍵)⍳⍺]}

hsu5←{
  y←⍵[i←⍋⍵]
  b←y≠¯1⌽y ⋄ b[⍳×≢b]←1
  m←(≢¨⊂)⍨b
  (m/b/y)i
}

hsu6←{(⊂p∊⍺)/¨p c ⊣ p←⍵[c←⍋⍵]}

Produce some sample data and test which functions produce matching results:

   a←a[⍋a←95?100]
   b←?1e6⍴100

   ∘.{⍎'a (hsu',⍺,' ≡ hsu',⍵,')b'}⍨ ' 1256'
1 1 0 0 1
1 1 0 0 1
0 0 1 0 0
0 0 0 1 0
1 1 0 0 1

Run some benchmarks. My own preference is for hsu6 even if it’s a bit slower, because it is so much simpler and clearer.

   cmpx 'a hsu'∘,¨' 1256',¨⊂' b'
  a hsu  b → 2.85E¯2 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  a hsu1 b → 6.38E¯3 | -78% ⎕⎕⎕⎕⎕⎕⎕
* a hsu2 b → 5.29E¯3 | -82% ⎕⎕⎕⎕⎕⎕
* a hsu5 b → 5.05E¯3 | -83% ⎕⎕⎕⎕⎕
  a hsu6 b → 5.88E¯3 | -80% ⎕⎕⎕⎕⎕⎕

4. Ancestors/Descendants

adj1←{
  u←(∪⍵,⍺)~⊂''           ⍝ the list of unique nodes 
  B←(2⍴≢u)⍴0             ⍝ adjacency matrix
  B[↓u⍳(~⍺∊⊂'')⌿⍺,⍪⍵]←1  ⍝ populate adjacency matrix
  C←{⍵∨∨.∧⍨⍵}⍣≡B         ⍝ transitive closure
  ⍸¨↓⍉C                  ⍝ list of lists of ancestors
}

adj2←{
  u←(∪⍵,⍺)~⊂''           ⍝ the list of unique nodes
  f←(≢u)⍴⊂⍬              ⍝ list of fathers
  b←~⍺∊⊂''
  f[u⍳b/k],←u⍳b/⍺        ⍝ populate list of fathers
  {∪¨⍵,¨⍵∘{∊⍺[⍵]}¨⍵}⍣≡f  ⍝ transitive closure
}

5. Prime Sieve

From A History of APL in 50 Functions, §46:

sieve←{
  b←⍵⍴{∧⌿↑(×/⍵)⍴¨~⍵↑¨1}q←2 3 5
  b[⍳6⌊⍵]←(6⌊⍵)⍴0 0 1 1 0 1 
  49≥⍵:b
  p←(≢q)↓⍸∇⌈⍵*0.5
  m←1+⌊(⍵-1+p×p)÷2×p
  b ⊣ p {b[⍺×⍺+2×⍳⍵]←0}¨ m
}

   sieve 20
0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1

   ⍸ sieve 20
2 3 5 7 11 13 17 19

   +/ sieve 1e9
50847534

6. Progressive Index-Of

pix0 and pix1 are from the FinnAPL Idiom Library. pix2 and pix3 are slightly faster versions of same.

pix0←{((⍋⍺⍳⍺,⍵)⍳⍳≢⍺)⍳(⍋⍺⍳⍵,⍺)⍳⍳≢⍵}
pix1←{((⍴⍺)⍴⍋⍋⍺⍳⍺,⍵)⍳(⍴⍵)⍴⍋⍋⍺⍳⍵,⍺}
pix2←{n←≢⍺ ⋄ i←⍺⍳⍺,⍵ ⋄ ((⍋i)⍳⍳≢⍺)⍳(⍋n⌽i)⍳⍳≢⍵}
pix3←{n←≢⍺ ⋄ i←⍺⍳⍺,⍵ ⋄ ((⍴⍺)⍴⍋⍋i)⍳(⍴⍵)⍴⍋⍋n⌽i}

   u←?1e4⍴2e9
   x←u[?1e5⍴≢u]
   y←u[?1e5⍴≢u]

   cmpx 'x pix0 y' 'x pix1 y' 'x pix2 y' 'x pix3 y'
  x pix0 y → 9.86E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  x pix1 y → 9.51E¯3 |  -4% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  x pix2 y → 8.45E¯3 | -15% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  x pix3 y → 8.06E¯3 | -19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

In terms of speed, all of these benefit from the special code in the Dyalog APL interpreter for small-range data in ⍺⍳⍵ and ⍋⍵ and for permutations in ⍋⍵ . (In ⍋⍋⍵ the argument of the leftmostis a permutation.) The following benchmarks illustrate the point:

   x←?1e6⍴1e6
   y←x ⋄ y[999999]←1e10  ⍝ x is small-range; y is not

   cmpx 'x⍳x' 'y⍳y'
  x⍳x → 9.31E¯3 |     0% ⎕⎕
  y⍳y → 1.29E¯1 | +1287% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   cmpx '⍋x' '⍋y'
  ⍋x → 5.11E¯2 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* ⍋y → 7.89E¯2 | +54% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   p←1e6?1e6             ⍝ p is a permutation
   q←p ⋄ q[999999]←⊃q    ⍝ q is not; both are small-range

   cmpx '⍋p' '⍋q'
  ⍋p → 5.81E¯3 |    0% ⎕⎕⎕
* ⍋q → 5.74E¯2 | +887% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

7. Sorting a Vector of Strings

In contrast:

   cmpx '⍋↑x'
7.18E¯3
   cmpx '{⍵[⍋↑⍵]}x'
7.56E¯3

The interpreter should detect the case where the argument consists of strings with roughly similar lengths, i.e. does not have a few really long strings amidst many more short strings, and do the ↑⍵ underneath the covers.
 

8. Without on Vectors of Strings

The computation can be sped up by working with ↑A and ↑B . See also problem #7.

   A←(↓⎕a[?(n,8)⍴26]),¨⍕¨?n⍴1e9 1e4 ⊣ n←100
   B←(↓⎕a[?(n,8)⍴26]),¨⍕¨?n⍴1e9 1e4 ⊣ n←15000

   f←{⍺⌿⍨(≢⍵)≤(≢⍵)(↑⍳↓)↑⍵,⍺}

   cmpx 'A~B' 'A f B'
  A~B   → 1.79E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  A f B → 5.49E¯4 | -70% ⎕⎕⎕⎕⎕⎕⎕⎕⎕

9. Inverted Tables

See Hui, Roger, Inverted Tables, Dyalog ’18, 2018-11-01.
 

10. Key Operator

Presumably the question (the “complaint”) is why n{+⌿⍵}⌸m is slower rather than why (n⍳n){+⌿⍵}⌸m is faster. The short answer is that the current implementation has not yet exploited an opportunity for faster code on small arguments.

You might think that ⍺{+⌿⍵}⌸⍵ should run at the same speed (⍺⍳⍺){+⌿⍵}⌸⍵ , because by definition key is based on ⍺⍳⍺ . In general that is true, the interpreter does ⍺⍳⍺ internally, but if the major cells ofare 1 or 2 bytes, the interpreter works withdirectly, without bothering to compute ⍺⍳⍺ . In such cases the interpreter creates a table with 256 entries (1-byte) or 65536 entries (2-byte).

For the argument n in question, each major cell is 2-bytes, so the table has 65536 entries, whereas major cells of n⍳n are 1-byte and the table has 256 entries. The cost comparison is

• working with a table with 65536 entries, versus
doing n⍳n plus working with a table with 256 entries.

Since n has only 10 rows, the second approach wins big. On my machine:

   cmpx 'n{+⌿⍵}⌸m' '(n⍳n){+⌿⍵}⌸m'
  n{+⌿⍵}⌸m     → 1.48E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (n⍳n){+⌿⍵}⌸m → 1.21E¯6 | -92% ⎕⎕

The following benchmarks show that the comparison becomes more even as the number of rows become larger, eventually favoring the shorter solution:

   expr←'y{+⌿⍵}⌸x' '(y⍳y){+⌿⍵}⌸x'

   cmpx expr ⊣ y←n[?q⍴10;] ⊣ x←m[?q⍴10;] ⊣ q←1e2
  y{+⌿⍵}⌸x     → 1.53E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (y⍳y){+⌿⍵}⌸x → 1.86E¯6 | -88% ⎕⎕⎕⎕

   cmpx expr ⊣ y←n[?q⍴10;] ⊣ x←m[?q⍴10;] ⊣ q←1e3
  y{+⌿⍵}⌸x     → 1.91E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (y⍳y){+⌿⍵}⌸x → 8.79E¯6 | -54% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   cmpx expr ⊣ y←n[?q⍴10;] ⊣ x←m[?q⍴10;] ⊣ q←1e4
  y{+⌿⍵}⌸x     → 6.22E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (y⍳y){+⌿⍵}⌸x → 7.32E¯5 | +17% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   cmpx expr ⊣ y←n[?q⍴10;] ⊣ x←m[?q⍴10;] ⊣ q←1e5
  y{+⌿⍵}⌸x     → 5.01E¯4 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (y⍳y){+⌿⍵}⌸x → 7.36E¯4 | +46% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

Up to now, we have not done much optimization for small arguments, because the times involved are small and therefore the pay-off wouldn’t be big. Optimizations on expressions taking a small amount of time are also hard to do. After the Dyalog ’18 conference rush I will think about this specific case.
 

11. Diane’s Lasagne Problem

See the blog post Diane’s Lasagne Problem, 2018-11-27.
 


created:  2018-07-05 18:55
updated: 2018-12-06 14:35