Shuffle Faster Please!

Andy reported that in the shuffle QA some functions take a long time:

m9249   “4½ days so far”
rankop  21.5 hours
m12466  26.3 hours
points  7.2 hours

Background: Shuffle QA is an intensification of the normal QA. The suite of over 1800 QA functions is run under shuffle, whereby every getspace (memory allocation) is followed by every APL array in the workspace being shifted (“shuffled”) and by a simple workspace integrity check. The purpose is to uncover memory reads/writes which are rendered unsafe by events which in normal operations are rare.

m9249

m9249 tests +\[a], ×\[a], ⌈\[a], and ⌊\[a]. It ran in under 3 seconds in normal operations; for it to take more than 4½ days under shuffle, getspace is clearly the dominant resource and hereafter we focus on that.

m9249 used expressions like:

assert (+⍀x) ≡ {⍺+⍵}⍀x

Since {⍺+⍵} is not recognized by the scan operator and therefore not supported by special code, {⍺+⍵}⍀x is done by the general O(n*2) algorithm for scans, AND on each scalar the operand {⍺+⍵} is invoked and a getspace is done. The improved, faster QA does this:

F ← {1≥≢⍵:⍵ ⋄ x ⍪ (¯1↑⍵) ⍺⍺⍨ ¯1↑x←⍺⍺ ∇∇ ¯1↓⍵}
assert (+⍀x) ≡ +F x

With such a change, the faster m9249 runs in 22 minutes under shuffle instead of over 4½ days, without any reduction in coverage.

The number of getspace are as follows: +⍀x takes O(1). The second expression {⍺+⍵}⍀x does ×\1↓⍴x separate scans, and each one takes O((≢x)*2) getspace; the total is therefore O(((≢x)*2)××/1↓⍴x) ←→ O((≢x)××/⍴x). The third expression +F is linear in ≢x and is therefore O(≢x) in the number of getspace. In m9249 the largest x has size 33 5 7. The following table shows the orders of complexity and what they imply for this largest x:

+⍀x O(1) 1
{⍺+⍵}⍀x O((≢x)××/⍴x) 38115 = 33 × ×/ 33 5 7
+F x O(≢x) 33

The ratio between 38115 and 33 goes a long way towards explaining why the time to run m9249 under shuffle was over 4½ days and is now 22 minutes. (Why does m9249 still take 22 minutes? It tests for the four functions + × ⌈ ⌊, for various axes, for different datatypes, and for different axis lengths.)

rankop

Another shuffle sluggard rankop took 47 hours. rankop tests expressions of the form f⍤r⊢yand x f⍤r⊢y for various functions f. The techniques used for reducing the number of getspace differ from function to function. We look at x↑⍤r⊢y as an illustration.

assert (4↑⍤1⊢y) ≡ 4{⍺↑⍵}⍤1⊢y

⍴y was 2 7 1 8 2 8. The expression took 7.6e¯5 seconds normally and 7.5 seconds under shuffle. The left hand side requires O(1) getspace, the right hand side O(×/¯1↓⍴x). The expression was changed to:

assert (4↑⍤1⊢y) ≡ 2 7 1 8 2 4↑y

Feels like cheating, doesn’t it? 🙂 The new expression takes 0.115 seconds under shuffle.

m12466

m12466 tests n⌈/[a]x (and n⌊/[a]x). Previously, it did this by comparing n⌈/[a]x against n{⍺⌈⍵}/[a]x for x with shape 7 17 13 11, for each of the four axes, for two different values of n, and for the 1-, 2-, 4-, and 8-byte numeric datatypes. It required 5238426 getspace and 26.3 hours to run under shuffle.

F←{p⍉⍺⍺⌿(⊂(⎕io-⍨⍳|⍺)∘.+⍳(1-|⍺)+(⍴⍵)[⍵⍵])⌷(⍋p←⍒⍵⍵=⍳⍴⍴⍵)⍉⍵}

F is a dyadic operator. The left operand ⍺⍺ is or ; the right operand ⍵⍵ is the axis; and n(⌈ F a)x is the same as n⌈/[a]x. n(⌈ F a)x mainly does n⌈⌿x; other axes are handled by transposing the axis of interest to be the first, then doing n⌈⌿x, then transposing the axes back into the required order. n(⌈ F a)x is much more complicated than n{⍺⌈⍵}/[a]x but has the advantage of using only O(1) getspace.

The revamped m12466 function uses 13717 getspace and 2 minutes to run under shuffle.

points

points is a function to test monadic format () for different values of ⎕pp.

points←{⍺←17 ⋄ ⎕pp←⍺
  tt←{(⌽∨\⌽⍵≠⍺)/⍵}                   ⍝ trim trailing ⍺s.    
  ~0∊∘.{                             ⍝ all pairs (⍳⍵)       
    num←'0'tt↑,/⍕¨⍺'.'⍵              ⍝ eg '9.3'             
    num≡⍕⍎num                        ⍝ check round trip     
  }⍨⍳⍵
}

The expression is 14 15 16 17 points¨ 100, meaning that for each of the 100 numbers num as text,

  1.1   1.2   1.3   1.4   1.5     1.100
  2.1   2.2   2.3   2.4   2.5     2.100
  3.1   3.2   3.3   3.4   3.5 ... 3.100
...
 99.1  99.2  99.3  99.4  99.5    99.100
100.1 100.2 100.3 100.4 100.5   100.100

a “round-trip” num≡⍕⍎num is evaluated. The expression requires 1.3 million getspace and 7.2 hours to run under shuffle.

A much more efficient version with respect to getspace is possible. Let x be a vector. ⍕¨x requires O(≢x) getspace; ⍕x requires O(1) getspace and, like ⍕¨x, formats each individual element of x independently.

points1←{
  ⎕io←⎕ml←1                                                 
  tt←{(⌽∨\⌽⍵≠⍺)/⍵}                   ⍝ trim trailing ⍺s     
  x←1↓∊(' ',¨s,¨'.')∘.,'0'tt¨s←⍕¨⍳⍵  ⍝ numbers as text      
  y←⍎x                                                      
  {⎕pp←⍵ ⋄ x≡⍕y}¨⍺                   ⍝ check round trip     
}

14 15 16 17 points1 100 requires 11050 getspace and less than 2 minutes to run under shuffle.

What to do?

Andy says that the shuffle QA takes over 2000 hours (!). The shuffle sluggards will be tackled as done here by using more parsimonious APL expressions to compare against the aspect being QA-ed. Going forward, a new QA function will be run under shuffle as it is being developed. The developers will be unlikely to tolerate a running time of 4½ days.

The Halting Problem Rendered in APL

The halting problem is the problem of determining, from a description of a program and an input, whether the program will finish running or continue to run forever. It was proved in the negative by Alonzo Church and Alan Turing in 1936, and can be rendered in Dyalog APL as follows:

Suppose there is an operator H such that f H ⍵ is 1 or 0 according to whether f ⍵ halts. Consider:

In dfns

   f←{∇ H ⍵:∇ ⍵ ⋄ ⍬}

For f 0, if ∇ H 0 is 1, then ∇ 0 is invoked, leading to an infinite recursion, therefore ∇ H 0 should have been 0. On the other hand, if ∇ H 0 is 0, then the part is invoked and is the result of f 0, therefore ∇ H 0 should have been 1.

Presented as a table: For f 0

suppose ∇ H 0 is invokes consequence ∇ H 0 should be
1 ∇ 0 infinite recursion 0
0 f 0 results in 1

Therefore, there cannot be such an operator H.

In Tradfns

   ⎕fx 'f x' '→f H x'

For f 0

suppose f H 0 is invokes consequence f H 0 should be
1 →1     infinite loop 0
0 →0 f 0 exits 1

PQA

PQA is an acronym for Performance Quality Assurance. We developed PQA to answer the questions, which APL primitives have slowed down from one build of the Dyalog interpreter to the next, and by how much? Currently, PQA consists of 13,659 benchmarks divided into 136 groups.





4

The Graphs

The graph above (and all other graphs in this article) plot the timing ratios between version 14.1 and the build of version 15.0 on 2016-01-22, the Windows 64-bit Unicode version. The data are the group geometric means. The horizontal axis are the groups sorted by the means. The vertical axis are logarithms of the means. The percent at the top (-17.1%) indicates that over all 136 groups, version 15.0 is 17.1% faster than version 14.1. The bottom of the graph indicates how many of the groups are faster (blue, 69.9%) and how many are more than 2% faster (89/136); and how many are slower (red, 30.1%) and how many are more than 2% slower (27/136).

For the graph above, the amount of blue is gratifying but the red is worrying. From past experience the more blue the better (of course) but any significant red is problematic; speed-ups will not excuse slow-downs.

PQA is run nightly when a new build of version 15.0 is available. Each PQA run takes about 12.5 hours on a dedicated machine. The graphs from the past month’s runs are presented on the strip on the right; the first of them is a shrunken version of the big graph at the beginning of the article.

Challenges

About two years ago, investigations into a user report of an interpreter slow-down underscored for us a “feature” of modern computer systems: An APL primitive can run slower even though the C source is changed only in seemingly inconsequential ways and indeed even unchanged. Moreover, the variability in measurements frequently exceeded the difference in timings between the two builds, and it was very difficult to establish any kind of pattern. It got to the point where even colleagues were doubting us — how hard can it be to run a few benchmarks?

  • The system is very noisy.
  • Cache is king.
  • The CPU acts like an interpreter.
  • Branching (in machine code) is slow.
  • Instruction counts don’t tell all.
  • The number of combinations of APL primitives and argument types is enormous.

What can be done about it?

  • Developed PQA.
  • Run PQA on a dedicated machine.
  • Make the system as quiet as possible, then make it quieter than that. 🙂
  • Set the priority of the APL task to be High or Realtime.
  • The ⎕ai or ⎕ts clocks have insufficient resolution. Use QueryPerformanceCounteror equivalent. The QPC resolution varies but is typically around 0.5e¯6 seconds.
  • Run each benchmark expression enough times to consume a significant amount of time (50e¯3 seconds, say).
  • Repeat the previous step enough times to get a significant sample.

Despite all these, timing differences of less than 5% are not reliably reproducible.

Silver Bullets

Given the difficulties in obtaining accurate benchmarks and given the amount of time and energy required to investigate (phantom) slow-downs, there evolved among us the idea of “silver bullets”: We will actively work on speed-ups instead of waiting for slow-downs to bite.

The speed-ups to be worked on are chosen based on:

  • decades of experience in APL application development
  • decades of experience in APL design and implementation
  • APLMON snapshots provided by users
  • talking to users
  • interactions on the Dyalog Forums
  • running benchmarks on user applications
  • user presentations at Dyalog User Meetings
  • the speed-up is easy, while mindful of Einstein’s admonition against drilling lots of holes where the drilling is easy

We were gratified to receive reports from multiple users that their applications, without changes, sped up by more than 20% from 13.2 to 14.0. A user reported that using the key operator in 14.0 sped up a key part of their application by a factor of 17. Evidently some of the silver bullets found worthwhile targets.

We urge you, gentle reader, to send us APLMON snapshots of your applications, to improve the chance that future speed-ups will actually speed up your applications. APLMON is a profiling facility that indicates which APL primitives are used, how much they are used, the size and datatype of the arguments, and the proportion of the overall time of such usage. An APLMON snapshot preserves the secrecy of the application code.

We will talk about the new silver bullets for version 15.0 in another blog post. Watch this space.

Coverage

We look at benchmarks involving + as a dyadic function to give some idea of what’s being timed. There are 72 such expressions. Variables with the following names are involved:

      'xy'∘.,'bsildz'∘.,'0124'
┌───┬───┬───┬───┐
│xb0│xb1│xb2│xb4│
├───┼───┼───┼───┤
│xs0│xs1│xs2│xs4│
├───┼───┼───┼───┤
│xi0│xi1│xi2│xi4│
├───┼───┼───┼───┤
│xl0│xl1│xl2│xl4│
├───┼───┼───┼───┤
│xd0│xd1│xd2│xd4│
├───┼───┼───┼───┤
│xz0│xz1│xz2│xz4│
└───┴───┴───┴───┘
┌───┬───┬───┬───┐
│yb0│yb1│yb2│yb4│
├───┼───┼───┼───┤
│ys0│ys1│ys2│ys4│
├───┼───┼───┼───┤
│yi0│yi1│yi2│yi4│
├───┼───┼───┼───┤
│yl0│yl1│yl2│yl4│
├───┼───┼───┼───┤
│yd0│yd1│yd2│yd4│
├───┼───┼───┼───┤
│yz0│yz1│yz2│yz4│
└───┴───┴───┴───┘

x and y denote left or right argument; bsildz denote:

b
s
i
l
d
z
boolean
short integer
integer
long integer
double
complex
1 bit
1 byte
2 bytes
4 bytes
8 bytes
16 bytes

0124 denote the base-10 log of the vector lengths. (The lengths are 1, 10, 100, and 10,000.)

The bsil variables are added to the like-named variable with the same digit, thus:

xb0+yb0  xb0+ys0  xb0+yi0  xb0+yl0
xs0+yb0  xs0+ys0  xs0+yi0  xs0+yl0
xi0+yb0  xi0+ys0  xi0+yi0  xi0+yl0
xl0+yb0  xl0+ys0  xl0+yi0  xl0+yl0

xb1+yb1  xb1+ys1  xb1+yi1  xb1+yl1
xs1+yb1  xs1+ys1  xs1+yi1  xs1+yl1
...
xi4+yb4  xi4+ys4  xi4+yi4  xi1+yl4
xl4+yb4  xl4+ys4  xl4+yi4  xl4+yl4

There are the following 8 expressions to complete the list:

xd0+yd0  xd1+yd1  xd2+yd2  xd4+yd4
xz0+yz0  xz1+yz1  xz2+yz2  xz4+yz4

The idea is to measure the performance on small as well as large arguments, and on arguments with different datatypes, because (for all we know) different code can be involved in implementing the different combinations.

After looking at the expressions involving+, the wonder is not why there are as many as 13,600 benchmarks but why there are so few. If the same combinations are applied to other primitives then for inner product alone there should be 38,088 benchmarks. (23×23×72, 23 primitive scalar dyadic functions, 72 argument combinations.) The sharp-eyed reader may have noticed that the coverage for + already should include more combinations:

  xb0+yb1  xb0+yb2  xb0+yb4
  xb0+ys1  xb0+ys2  xb0+ys4
  xb0+yi1  xb0+yi2  xb0+yi4
  xb0+yl1  xb0+yl2  xb0+yl4
  ...
  xl0+yi1  xl0+yi2  xl0+yi4
  xl0+yl1  xl0+yl2  xl0+yl4 

We will be looking to fill in these gaps.

More on the Graphs

You should know that the PQA graphs shown here are atypical. The “blue mountain” is higher and wider than we are used to seeing, and it is unprecedented for a graph (the one for 2016-02-26) to have no red at all.

From version 15.0, the list of new silver bullets is somewhat longer than usual, but silver bullets usually only explain the blue peak. (Usually, speed-ups are implemented only if they offer a factor of 2 or greater improvement, typically achieved only on larger arguments.) What of the part from the “inflection point” around 30 and thence rightward into the red pit?

We believe the differences are due mainly to using a later release of the C compiler, Visual Studio 2015 in 15.0 vs. Visual Studio 2005 in 14.1. The new C compiler better exploits vector and parallel instructions, and that can make a dramatic difference in the performance of some APL primitives without any change to the source code. For example:

      x←?600 800⍴0
      y←?800 700⍴0
      cmpx 'x+.×y' ⍝ 15.0
1.54E¯1
      cmpx 'x+.×y' ⍝ 14.1
5.11E¯1

(Not all codings of inner product would benefit from these compiler techniques. The way Dyalog coded it does, though.)

Together with improvements for specific primitives such as +.× the new C compiler also provides a measurable general speed-up. Unfortunately, along with optimizations the new C compiler also comes with pessimizations. (Remember: speed-ups will not excuse slow-downs.) Some commonly and widely used C facilities and utilities did not perform as well as before. Your faithful Dyalog development team has been addressing these shortcomings and successfully working around them. Having a tool like PQA has been invaluable in the process.

It has been an … interesting month for PQA from 2016-01-22 to 2016-02-27. Now it’s onward to ameliorating the red under Linux and AIX.

Quicksort in APL Revisited

A message in the Forum inquired on sorting strings in APL with a custom comparison function.

First, sorting strings without a custom comparison function obtains with a terse expression:

      {⍵[⍋↑⍵]} 'syzygy' 'boustrophedon' 'kakistocracy' 'chthonic'
┌─────────────┬────────┬────────────┬──────┐
│boustrophedon│chthonic│kakistocracy│syzygy│
└─────────────┴────────┴────────────┴──────┘

Sorting strings with a custom comparison function can also be accomplished succinctly by simple modifications to the Quicksort function

Q←{1≥≢⍵:⍵ ⋄ S←{⍺⌿⍨⍺ ⍺⍺ ⍵} ⋄ ⍵((∇<S)⍪=S⍪(∇>S))⍵⌷⍨?≢⍵}

in a Dyalog blog post in 2014. A version that accepts a comparison function operand is:

QS←{1≥≢⍵:⍵ ⋄ (∇ ⍵⌿⍨0>s),(⍵⌿⍨0=s),∇ ⍵⌿⍨0<s←⍵ ⍺⍺¨ ⍵⌷⍨?≢⍵}

The operand function ⍺ ⍺⍺ ⍵ is required to return ¯1, 0, or 1, according to whether is less than, equal to, or greater than . In QS, the phrase s←⍵ ⍺⍺¨ ⍵⌷⍨?≢⍵ compares each element of against a randomly chosen pivot ⍵⌷⍨?≢⍵; the function then applies recursively to the elements of which are less than the pivot (0>s) and those which are greater than the pivot (0<s).

Example with numbers:

      (×-)QS 3 1 4 1 5 9 2 6 5 3 5 8 97 9
1 1 2 3 3 4 5 5 5 6 8 9 9 97

      3(×-)4
¯1
      3(×-)3
0
      3(×-)¯17
1

Example with strings:

      strcmp←{⍺≡⍵:0 ⋄ ¯1*</⍋↑⍺ ⍵}

      'syzygy' strcmp 'syzygy'
0
      'syzygy' strcmp 'eleemosynary'
1
      'syzygy' strcmp 'zumba'
¯1

      strcmp QS 'syzygy' 'boustrophedon' 'kakistocracy' 'chthonic'
┌─────────────┬────────┬────────────┬──────┐
│boustrophedon│chthonic│kakistocracy│syzygy│
└─────────────┴────────┴────────────┴──────┘

A more efficient string comparison function is left as an exercise for the reader. 🙂

APL Puns

In the Beginning

floor
ceiling
log

and were punnish when the notation was introduced in 1962. They have long since gone mainstream, used even by respectable mathematicians and computer scientists.

To see why is a pun, see the Vector article My Favourite APL Symbol.

•         •        •        •         •

The following are a joint effort by the Dyalog team, many derived after a sumptuous dinner and a few glasses of wine. Column 0 are valid APL expressions. For an enhanced experience may we suggest covering column 1 as you read column 0 for the first time.

In Modern Times

⊃⎕A~'L' the first no el
⍴⍴⍴'your boat' rho rho rho your boat
*⍣¯1⊢fly+fly log of the flies
{⍺←23 ⋄ (2÷(○1)*0.5)×-/⍵×(⍵*2×⍳⍺)÷
(!⍳⍺)×1+2×⍳⍺}ghanistan
erf ghanistan
↑^≡ mix and match
>/+/∘⎕UCS¨'the parts' 'the whole' the sum of the parts is greater than the whole
?⍨2e9 a big deal
○1 APL π
2(⌷⍤1)2 3 4⍴○1 a slice of APL π
0 0⍉3 4⍴○1 another slice of APL π
easy←○1 easy as π
{4×-/÷1+2×⍳⍵}1e6 a π long in the making
pike; see APL Quotations and Anecdotes
spike
pine
spine
Concorde taking (and showing) off
uh-oh
{⍵⌷⍨⊂?⍨≢⍵} degrade
bb∨~bb two b or not two b
bb≥bb two b or not two b
×× sign of the times
a←2÷⍨3 4⍴⍳12 ⋄ *○0j1×a show off
a←2÷⍨3 4⍴⍳12 ⋄ *○0j1×a+2e9 shameless show off
⍎turtles←'⍎turtles' it’s turtles all the way down…

In Honour of a Certain Movie Series

0⍴'menace' I. The Empty Menace
⊢⎕ns¨ 2⍴⎕this II. Tack of the Clones
⌽'htxiS'~'x' III. Reverse of the Sith
⎕io←~⎕io IV. A New Beginning
{'Jedi'⊢1e3×○R} V. The M π R Strikes from the Back
{'Jedi'} VI. Return Jedi
(+⌿÷≢)3 4 5⊣⎕dl 10 VII. The Fork Awakens
may←+⌿÷≢ ⋄ may∘∪ may the fork be with you
0 1 2 3 (⌊/⍬) 5 6 the fourth is strong with this one
4 4 4 4 (⌊/⍬) 4 4 I feel a great disturbance in the fours
1=?2e9 this is not the number you are looking for
1=?2e9 in my experience, there’s no such thing as luck
Luke.## I AM your father
4 ⊃ 0 1 2 3 'dark side' vier ist the path to the dark side
x{0.5×⍵+⍺÷⍵}⍣≡darkside do not underestimate the power of the dark side
⎕ct←1 I find your lack of faith disturbing
0.2 ∊ ÷3 1 4 1 9 2 6 3 8 9 7 9 I find your lack of a-fifth disturbing
{0::⍺⍺~⍵ ⋄ ⍺⍺ ⍵} do, or do not; there is no :Try
R2D2
>⍝ Help me, Obi-Wan Kenobi, you’re my only hope
⊂○⊃ Princess Leia
<○> Yoda
○>--- Death Star
>∘< X-wing fighter
⊢○⊣ tie fighter
○⍨ tie fighter in stealth
⊢○⊣ 1 π fighter
⊣⍲⍲ ATAT
Ewok
Wookie
imperial cruiser
⍋⍒ imperial cruisers colliding
┌──┬──┬──┬──┬──┬──┬──┬──┐
│⊢ │⍒ │⍒⍒│⍋⌽│⌽ │⍋ │⍋⍒│⍒⌽│
├──┼──┼──┼──┼──┼──┼──┼──┤
│⍒ │⍒⍒│⍋⌽│⊢ │⍒⌽│⌽ │⍋ │⍋⍒│
├──┼──┼──┼──┼──┼──┼──┼──┤
│⍒⍒│⍋⌽│⊢ │⍒ │⍋⍒│⍒⌽│⌽ │⍋ │
├──┼──┼──┼──┼──┼──┼──┼──┤
│⍋⌽│⊢ │⍒ │⍒⍒│⍋ │⍋⍒│⍒⌽│⌽ │
├──┼──┼──┼──┼──┼──┼──┼──┤
│⌽ │⍋ │⍋⍒│⍒⌽│⊢ │⍒ │⍒⍒│⍋⌽│
├──┼──┼──┼──┼──┼──┼──┼──┤
│⍋ │⍋⍒│⍒⌽│⌽ │⍋⌽│⊢ │⍒ │⍒⍒│
├──┼──┼──┼──┼──┼──┼──┼──┤
│⍋⍒│⍒⌽│⌽ │⍋ │⍒⍒│⍋⌽│⊢ │⍒ │
├──┼──┼──┼──┼──┼──┼──┼──┤
│⍒⌽│⌽ │⍋ │⍋⍒│⍒ │⍒⍒│⍋⌽│⊢ │
└──┴──┴──┴──┴──┴──┴──┴──┘

 

 

 

climatic space battle

Response to Feedback on Cut, Under and Merge

At Dyalog ’15 John Scholes and I presented proposals for future operators cut, under, and merge. Following the release of the video of this presentation, we received some feedback from a user. Our response to the feedback may be of wider interest.

  1. It’s early days yet for cut, under, and merge (they are tentatively planned for release in version 16.0), and one of the reasons for that is so that we can receive feedback from the user community. So thank you for your comments.
  2. We take our inspiration from many sources, and while it’s true that J has cut, under, and merge, cut saw the light of day in Rationalized APL (1983), under in Operators and Functions (1978), and merge in Rationalized APL (1983) again. And even before that, the ideas have been in mathematics or functional programming for many years.Wherever the inspiration came from, we don’t take them just on faith, just because they are in J.Even now, even in these early days, cut, under, and merge have undergone a vigorous internal debate. The discussion is informed by the experience in other contexts, J or otherwise.
  3. We are interested in what you think are the “few fundamental ideas expressed by a handful of mathematical symbols”, what you say is the definition of APL. From our perspective we have not violated and will not violate what we think are the fundamental ideas. But perhaps our set of fundamental ideas differ from yours?
  4. No one here is thinking of adopting the J spelling scheme (with the dots and colons).
  5. The Dyalog development team are in unanimous agreement that dfns, ⍺⍵-functions, are A Good Thing. As we see it, the recent and proposed additions to Dyalog APL have only enhanced the elegance and utility of dfns. For example, {(+⌿⍵)÷≢⍵}⍤r to compute the mean of not just vectors but subarrays of a specified rank, {⍺⍵}⌸ to compute the unique keys and the corresponding indices, {⍺(+⌿⍵)}⌸ to compute the unique keys and corresponding sums, and {⊂Cut ¯2⊢⍵,','}Cut ¯2 ⊢x~⎕av[4] to partition CSV x into its columns and cells.
  6. Cut, under, and merge are described and documented in discussion papers accompanying the video. If you have not already read them, they can be found in the Dyalog ’15 webpage and specifically here. Please tell us what more can be done as far as documentation, and what parts specifically are unclear.
  7. We have previously (and recently) examined the question of adding quaternions to Dyalog. Currently there are no plans to add them.
  8. Finally, examples of under from everyday life. Of course the processes involved can be composed, so that for example “dinner” can be:
    open the fridge
        take the food
    close the fridge

    but it can also be a more elaborate:

    take plate from cupboard
        open the fridge
            take the food
        close the fridge
     
        put food on plate
            open mouth
                put food in mouth
            close mouth
        clean plate
    put plate back into cupboard

    As remarked in the Dyalog ’15 presentation, once you are attuned to it you can see under everywhere, sometimes in subtle ways: Ashes to ashes, dust to dust.