The Reflex/Commute Operator

The monadic operator is defined and modeled as follows:

     f⍨ ⍵  ←→  ⍵ f ⍵
   ⍺ f⍨ ⍵  ←→  ⍵ f ⍺

   {⍺←⍵ ⋄ ⍵ ⍺⍺ ⍺}

Reflex

Some common well-known functions can be written as f⍨ where f is itself a well-known function:

   +⍨    double
   ×⍨    square
   ?⍨    random permutation
   ⍳⍨    self-index, APL Amuse-Bouche 3

See http://www.jsoftware.com/jwiki/Essays/Reflexive for further examples.

Awareness of the importance of the reflexive case might have led us to avoid the mistake in the definition of the dyadic case of . That is, if

   ⍺⍋⍵ ←→ ⍺⌷⍨⊂⍋⍵

then ⍋⍨⍵ would be sort. Ken Iverson seemed to have had this awareness because that’s how he defined the dyadic case of in J. (I say “seemed” because he expressed surprise when first shown this use of ⍋⍨.)

The monadic case f⍨ came relatively late. It was not in Operators and Functions (1978) nor Rationalized APL (1983), and only introduced in A Dictionary of APL (1987). It came to Ken Iverson when he explicitly looked to natural languages for inspiration, whence it became “obvious”: f⍨⍵ ←→ ⍵ f ⍵ is the reflexive voice (je m’appelle Roger) and ⍺ f⍨ ⍵ ←→ ⍵ f ⍺ is the passive voice (the programming competition was won by a 17-year-old student vs. a 17-year-old student won the programming competition), both having evolved in natural languages for effective communication and elegant expression.

Commute

The alternative definition of ⍺⍋⍵ above, while illustrating the importance of the reflexive case (f⍨ ⍵), also illustrates the passive case (⍺ f⍨ ⍵). ⍺⌷⍨⊂⍋⍵ can be read and understood as “ indexed by the enclosed grade of “, a different (and in my mind a better) emphasis than (⊂⍋⍵)⌷⍺, “the enclosed grade of , indexed into “.

I note that Arianna Locatelli, winner of the 2015 APL Problem Solving Competition but an APL beginner, used twenty-one times in her presentation at Dyalog ’15. For example, was used in the computation of the standard deviation (slide 13):

   sol5←{a←,⍵ ⋄ ⊃0.5*⍨(⍴a)÷⍨+/2*⍨a-(⍴a)÷⍨+/a}

I believe this formulation comes naturally because can be used to write functions in the order that they are applied. Another way to put it is that reduces the need for long-scope parentheses. For example:

   (2×a) ÷⍨ (-b) (+,-) 0.5 *⍨ (b*2)-4×a×c
   ((-b) (+,-) ((b*2)-4×a×c)*0.5) ÷ 2×a

That is easy to implement, {⍺←⍵ ⋄ ⍵ ⍺⍺ ⍺}, is neither here nor there; its value as a tool of thought is easily demonstrated.

Permutations

I started composing a set of APL exercises in response to a query from a new APL enthusiast who attended Morten’s presentation at Functional Conf, Bangalore, October 2014. The first set of exercise are at a low level of difficulty, followed by another set at an intermediate level. One of the intermediate exercises is:

Permutations

a. Compute all the permutations of ⍳⍵ in lexicographic order. For example:

   perm 3
0 1 2
0 2 1
1 0 2
1 2 0
2 0 1
2 1 0

b. Write a function that checks whether is a solution to perm ⍺, without computing perm ⍺. You can use the function assert. For example:

assert←{⍺←'assertion failure' ⋄ 0∊⍵:⍺ ⎕SIGNAL 8 ⋄ shy←0}

pcheck←{
  assert 2=⍴⍴⍵:
  assert (⍴⍵)≡(!⍺),⍺:
  …
  1
}

   6 pcheck perm 6
1

   4 pcheck 2 4⍴0 1 2 3, 0 1 3 2
assertion failure
pcheck[2] assert(⍴⍵)≡(!⍺),⍺:
         ∧

c. What is the index of permutation in perm ⍺? Do this without computing all the permutations. For example:

   7 ip 1 6 5 2 0 3 4    ⍝ index from permutation
1422

(The left argument in this case is redundant, being the same as ≢⍵.)

d. What is the -th permutation of ⍳⍺? Do this without computing all the permutations. For example:

   7 pi 1442             ⍝ permutation from index
1 6 5 2 0 3 4

   (perm 7) ≡ 7 pi⍤0 ⍳!7
1

The Anagram Kata

Coincidentally, Gianfranco Alongi was attempting in APL the anagrams kata from Cyber Dojo:

Write a program to generate all potential anagrams of an input string.

For example, the potential anagrams of “biro” are
biro bior brio broi boir bori
ibro ibor irbo irob iobr iorb
rbio rboi ribo riob roib robi
obir obri oibr oirb orbi orib

This is essentially the same program/exercise/kata, because the potential anagrams are 'biro'[perm 4]. You can compare solutions in other languages to what’s here (google “anagrams kata”).

Spoiler Alert

Deriving a Solution

I am now going to present solutions to part a of the exercise, generating all permutations of ⍳⍵.

Commonly, in TDD (test-driven development) you start with a very simple case and try to extend it successively to more general cases. It’s all too easy to be led into a dead-end because the simple case may have characteristics absent in a more general case. For myself, for this problem, I would start “in the middle”: Suppose I have perm 3, obtained by whatever means:

   p
0 1 2
0 2 1
1 0 2
1 2 0
2 0 1
2 1 0

How do I get perm 4 from that? One way is as follows:

   p1←0,1+p
   (0 1 2 3[p1]) (1 0 2 3[p1]) (2 0 1 3[p1]) (3 0 1 2[p1])
┌───────┬───────┬───────┬───────┐
│0 1 2 3│1 0 2 3│2 0 1 3│3 0 1 2│
│0 1 3 2│1 0 3 2│2 0 3 1│3 0 2 1│
│0 2 1 3│1 2 0 3│2 1 0 3│3 1 0 2│
│0 2 3 1│1 2 3 0│2 1 3 0│3 1 2 0│
│0 3 1 2│1 3 0 2│2 3 0 1│3 2 0 1│
│0 3 2 1│1 3 2 0│2 3 1 0│3 2 1 0│
└───────┴───────┴───────┴───────┘

So it’s indexing each row of a matrix m by 0,1+p. There are various ways of forming the matrix m, one way is:

   ⍒⍤1∘.=⍨0 1 2 3
0 1 2 3
1 0 2 3
2 0 1 3
3 0 1 2

(Some authors waxed enthusiastic about this “magical matrix”.) In any case, a solution obtains readily from the above description: Form a matrix from the above individual planes; replace the 0 1 2 3 by ⍳⍵; and make an appropriate computation for the base case (when 0=⍵). See the 2015-07-12 entry below.

The Best perm Function

What is the “best” perm function I can write in APL? This “best” is a benchmark not only on my own understanding but also on advancements in APL over the years.

“Best” is a subjective and personal measure. Brevity comes into it but is not the only criteria. For example, {(∧/(⍳⍵)∊⍤1⊢t)⌿t←⍉(⍵⍴⍵)⊤⍳⍵*⍵} is the shortest known solution, but requires space and time exponential in the size of the result, and that disqualifies it from being “best”. The similarly inefficient {(∧/(⍳⍵)∊⍤1⊢t)⌿t←↑,⍳⍵⍴⍵} is shorter still, but does not work for 1=⍵.

1981, The N Queens Problem

    p←perm n;i;ind;t
   ⍝ all permutations of ⍳n
    p←(×n,n)⍴⎕io
    →(1≥n)⍴0
    t←perm n-1
    p←(0,n)⍴ind←⍳n
    i←n-~⎕io
   l10:p←(i,((i≠ind)/ind)[t]),[⎕io]p
    →(⎕io≤i←i-1)⍴l10

It was the fashion at the time that functions be written to work in either index-origin and therefore have ⎕io sprinkled hither, thither, and yon.

1987, Some Uses of { and }

   perm:  ⍪⌿k,⍤¯1 (⍙⍵-1){⍤¯ 1 k~⍤1 0 k←⍳⍵
       :  1≥⍵
       :  (1,⍵)⍴0

Written in Dictionary APL, wherein: ⍪⌿⍵ ←→ ⊃⍪⌿⊂⍤¯1⊢⍵ and differs from its definition in Dyalog APL; is equivalent to in dfns; ⍺{⍵ ←→ (⊂⍺)⌷⍵; and ¯ by itself is infinity.

1990-2007

I worked on perm from time to time in this period, but in J rather than in APL. The results are described in a J essay and in a Vector article. The lessons translate directly into Dyalog APL.

2008, http://dfns.dyalog.com/n_pmat.htm

   pmat2←{{,[⍳2]↑(⊂⊂⎕io,1+⍵)⌷¨⍒¨↓∘.=⍨⍳1+1↓⍴⍵}⍣⍵⍉⍪⍬}

In retrospect, the power operator is not the best device to use, because the left operand function needs both the previous result (equivalent to perm ⍵-1) and . It is awkward to supply two arguments to that operand function, and the matter is finessed by computing the latter as 1+1↓⍴⍵.

In this formulation, ⍉⍪⍬ is rather circuitous compared to the equivalent 1 0⍴0. But the latter would have required a or similar device to separate it from the right operand of the power operator.

2015-07-12

   perm←{0=⍵:1 0⍴0 ⋄ ,[⍳2](⊂0,1+∇ ¯1+⍵)⌷⍤1⍒⍤1∘.=⍨⍳⍵}

For a time I thought the base case can be ⍳1 0 instead of 1 0⍴0, and indeed the function works with that as the base case. Unfortunately (⍳1 0)≢1 0⍴0, having a different prototype and datatype.

Future

Where might the improvements come from?

  • We are contemplating an under operator whose monadic case is f⍢g ⍵ ←→ g⍣¯1 f g ⍵. Therefore 1+∇ ¯1+⍵ ←→ ∇⍢(¯1∘+)⍵
  • Moreover, it is possible to define ≤⍵ as ⍵-1 (decrement) and ≥⍵ as ⍵+1 (increment), as in J; whence 1+∇ ¯1+⍵ ←→ ∇⍢≤⍵
  • Monadic = can be defined as in J, =⍵ ←→ (∪⍳⍨⍵)∘.=⍳≢⍵ (self-classify); whence ∘.=⍨⍳⍵ ←→ =⍳⍵

Putting it all together:

   perm←{0=⍵:1 0⍴0 ⋄ ,[⍳2](⊂0,∇⍢≤⍵)⌷⍤1⍒⍤1=⍳⍵}

We should do something about the ,[⍳2] :-)​​

In Praise of Magic Functions: Part II

Part I of this post was concerned with the development speed and execution speed of magic functions and should be read before this post.

Benefitting from Future and Past Improvements

Looking at the magic function for key in Part I, we see that its performance depends on the following APL primitives, listed with information on when they were last improved (or when they will be improved):

expression factor version
⍋i 2-18 12.1
⎕DR 1-3.8 14.1
⍳⍨x 2-4 14.1
⍳⍨i 3-6 14.1
↑x 5-12 15.0
⊂[a]x 1-1.75 15.0
b⊂[⎕IO]x 1.3-42 14.1

Key was introduced in version 14.0 and x{⍺,f⌿⍵}⌸y is currently computed by the general case. Comparing the speed in version 14.0 to 15.0 as of June 2015:

   x←?1e6⍴5e4
   y←?1e6⍴0

   1 1 cmpx 'x{⍺,+/⍵}⌸y'    ⍝ 14.0
0.1735

   1 1 cmpx 'x{⍺,+/⍵}⌸y'    ⍝ 15.0
0.12855

   0.1735 ÷ 0.12855
1.34967

a factor of 1.3 improvement without touching the key code. Special code had been planned for {⍺,f/⍵}⌸, and we can get an idea of the amount of speed up:

   cmpx 'x{⍺,+/⍵}⌸y' '(∪x),⍤0⊢x{+/⍵}⌸y'    ⍝ 15.0
 x{⍺,+/⍵}⌸y       → 1.25E¯1 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
 (∪x),⍤0⊢x{+/⍵}⌸y → 1.16E¯2 | -91% ⎕⎕⎕

Perhaps the special code for {⍺,f/⍵}⌸ can be a magic function?

Magic functions also benefit from past improvements. The primitive ⍕b (in the Suggestivity subsection below) has a magic function implementation: MAGIC("(¯1+2×≢⍵)⍴(2 2⍴'0 1 ')[⍵;]⊣⎕io←0"). Then I thought, why not get an extra level of performance by doing it in native C? Benchmarks done after that work showed that the native C implementation was faster by a factor of 25 but the magic function was faster by a factor of 40. What happened? On reflection, the speed of the code depends crucially on the speed of x[b;…;] and it turned out that x[b;…;] was previously sped up (version 13.2). Apparently the x[b;…;] code is faster than a casually-written C program.

Finally, I expect magic functions are prime candidates for being fed to the APL compiler when it becomes possible to do so.

APL as a Tool of Implementation

Being APL code, magic functions have the characteristic terseness of APL and all that that entails. K.E. Iverson’s 1979 Turing Lecture Notation as a Tool of Thought lists five important characteristics of notation. We examine how they play out in this context.

Ease of Expression
Notation as a Tool of Thought entitles this characteristic as “ease of expressing constructs arising in problems”, explicated as follows (§1.1):

If it is to be effective as a tool of thought, a notation must allow convenient expression not only of notions arising directly from a problem, but also of those arising in subsequent analysis, generalization, and specialization.

For magic functions, the analogous concept would be facility in program development, debugging, tracing, profiling, benchmarking, and other activities required in implementation. Dyalog APL has these in spades.

Suggestivity
It was first identified that ⍕b can be sped up by doing (¯1+2×≢⍵)⍴(2 2⍴'0 1 ')[⍵;]⊣⎕io←0, the idea being that if f is an expensive function and x is from a small domain D, then f x can be computed by (f D)[⍵;]. (A similar idea is described in §4.3 of Notation as a Tool of Thought.) For the idea to be effective, f does not have to be very expensive, just more expensive than indexing, and x is sufficiently large. A few speed-ups in the pattern have been identified:

⍕b
x⍕b
float ⍳ b
float ⍳ int8
scalar×b and b×scalar
scalar*b

All of these can be implemented by magic functions. As well, the utility of the pattern (f D)[i;] motivates extra attention on x[i;…;]. Fortuitously, x[b;…;] has previously been made fast.

Subordination of Detail
Writing in APL means not having to deal with many details involved in writing in C in the interpreter, including:

  • workspace compaction
  • arrays changing their internal datatype
  • call by name
  • allocating and releasing memory
  • index errors AKA memory read and write exceptions
  • different numeric types
  • different character types
  • loops and nested loops
  • declarations
  • etc.

Economy
Economy refers to the size of the vocabulary and the complexity of the grammar. Here, the comparison is not just between APL and C (in which comparison APL wins), but between APL and the collection of programs, utilities, macros, data structures, typedefs, calling conventions, …, accumulated by the C source code over the years. These have not been as rigorously defined and executed as APL.

Amenability to Formal Manipulation
The opening paragraph of the present text had a different magic function for ∧.= in a previous version. In repeated readings of the MSS and in meditation on the ideas I realized that the line of code can be made simpler:

MAGIC("((≢⍺)↑i)∘.=(≢⍺)↓i←⍺⍳⍺⍪⍉⍵");    ⍝ old version
MAGIC("(≢⍺)(↑∘.=↓)⍺⍳⍺⍪⍉⍵");           ⍝ current version

I daresay that such simplification would be harder to conceive with a more verbose statement of the algorithm.

In Praise of Magic Functions: Part I

A magic function is an APL-coded (dfn) computation in the C source of the interpreter. For example, some cases of ∧.= are coded as MAGIC("(≢⍺)(↑∘.=↓)⍺⍳⍺⍪⍉⍵"). The rubric “magic function” includes magic functions and magic operators.

Acknowledgments. Magic functions in Dyalog are made possible due to work done by John Scholes and Richard Smith.

Development Speed

A magic function implementation takes an order of magnitude less time to write than a C-coded implementation. A case in point is ∘.≡ (and ∘.≢) on enclosed character vectors, as recounted in the blog post A Speed-Up Story. The chronology is recovered from the G-mail, Mantis and SVN systems.

2014-10-27 04:24 Nicolas initial e-mail describing the problem
2014-10-27 07:33 Roger proposes i∘.=i←a⍳a as a solution
2014-10-27 07:36 Jay objects that solution should check ⎕CT
2014-10-27 07:40 Roger responds to objection
2014-10-27 10:29 Roger creates Mantis issue
2014-10-27 13:57 Roger SVN commit
2014-10-27 18:12 Roger reports factor 6-64 speed-up and submits blog post
2014-10-28 16:33 Roger SVN commit to fix a bug

After checking that ⎕CT is not required, the main processing in the C source is as follows:

  • 1 iff a “selfie”, i.e. the left and right arguments are equal as C pointers
  • eq 1 if ∘.≡ ; 0 if ∘.≢
  • 1 iff the right argument has rank 1
#define CASE(x,y,z)  ((x<<2)+(y<<1)+(z))

switch(CASE(c,eq,r))
{
  case CASE(0,0,0): MAGIC("((,⍺)⍳⍺)∘.≠((,⍺)⍳⍵)"); break;
  case CASE(0,0,1): MAGIC("(  ⍺ ⍳⍺)∘.≠(  ⍺ ⍳⍵)"); break;
  case CASE(0,1,0): MAGIC("((,⍺)⍳⍺)∘.=((,⍺)⍳⍵)"); break;
  case CASE(0,1,1): MAGIC("(  ⍺ ⍳⍺)∘.=(  ⍺ ⍳⍵)"); break;
  case CASE(1,0,0): MAGIC("∘.≠⍨(,⍵)⍳⍵");          break;
  case CASE(1,0,1): MAGIC("∘.≠⍨  ⍵ ⍳⍵");          break;
  case CASE(1,1,0): MAGIC("∘.=⍨(,⍵)⍳⍵");          break;
  case CASE(1,1,1): MAGIC("∘.=⍨  ⍵ ⍳⍵");          break;
}

Execution Speed

Development speed for a magic function need not be at the expense of execution speed. As indicated above, ∘.≡ is 6 to 64 times faster than the previous (C coded) implementation. Factors for a few other magic function implementations:

expression factor version
x∘.≡y and x∘.≢y 6-64 14.1
x∧.=y and x∨.≠y 1.5-4 14.1
,/y 1-∞ 15.0
x⊥y 1-26 15.0
x⊤y 1-3.3 15.0

One can always make a magic function faster by hand-translating it from APL into C, and in so doing save on the tokenizing (scanning) and parsing costs. However, the effort increases the development time and the code size (see Special Cases below) for (I argue) not much reduction in execution time. I also expect that such translation can be done automatically by the APL compiler in the future.

Accurate estimates for the amount of speed up obtain readily from APL benchmarks. For example, x⍕b where x is a scalar or 2-element vector and b is a Boolean array, is a candidate for a magic function implementation, and:

   f←{((¯1↓s),(⊃⌽⍴t)×⊃⌽s←⍴⍵)⍴(t←⍺⍕⍪0 1)[⍵;]}

   b←1=?10⍴2     ⍝ small argument
   cmpx '2 f b' '2⍕b'
 2 f b → 5.83E¯6 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
 2⍕b   → 1.23E¯5 | +110% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   b←1=?1e6⍴2    ⍝ large argument
   cmpx '8 2 f b' '8 2⍕b'
 8 2 f b → 9.75E¯3 |     0% ⎕
 8 2⍕b   → 3.35E¯1 | +3339% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

We can say with confidence that x⍕b can be made 2 to 34 times faster before actually proceeding with the implementation.

Magic functions can be used when execution speed is not paramount. For example a problem was reported with !∘y⍣¯1:

   !∘25⍣¯1 ⊢ 1e4
DOMAIN ERROR
   !∘25⍣¯1⊢10000
  ∧

The problem can be solved readily with an APL function using Newton iteration, with a relatively complicated computation for the initial estimate:

   f←{-∘⍵∘⍺⍺{⍵-(⊃t)ׯ0.01÷-/t←⍺⍺ ⍵×1 1.01}⍣≡⊃⍋|⍵-(⍳!⊢)⌊0.5+⍺⍺ 1}
   ⎕←x←!∘25 f 1e4
3.85142
   x!25
10000

General Case vs. Special Cases

Magic functions are well-suited for implementing operators. An operator has potentially tens or hundreds of cases, and it would be onerous, not cost-effective and unnecessary to write code for each case. Instead, the general case can be implemented quickly and succinctly by a magic function, and the saved effort can be devoted to cases deemed worthy of attention. The rank and key operators are done this way. The magic function for the general case of key:

MAGIC(
  "0=⎕nc'⍺':⍵ ∇ ⍳≢⍵    ⋄"  // monadic case: f⌸ x ←→ x f⌸ ⍳≢x
  "⍺ ⍺⍺{⍺ ⍺⍺ ⍵}{        "
  "  ⎕ml←1             ⋄"  // needed by ↑⍵ and ⍺⊂⍵ 
  "  j←⍋i⊣⎕dr i←⍳⍨⍺    ⋄"
  "  ↑ (⊂[1↓⍳⍴⍴⍺]((⍳≢i)=⍳⍨i)⌿⍺) ⍺⍺¨ (2≠/¯1,i[j])⊂[⎕io]⍵⌷⍨⊂j"
  "}⍵                   "
);

Before execution reaches the general case, special cases are detected and are implemented by special code, usually in C. These special cases are:

operand version comment
{f⌿⍵} and ⊢∘(f⌿) 14.0 for f one of + ⌈ ⌊ or (for Boolean right arguments) one of ∧ ∨ = ≠; also / instead of for vector right arguments
{⍺(f⌿⍵)} 15.0 for f as above
{⍺,f⌿⍵} 15.0 for f as above and numeric left arguments
{≢⍵} and ⊢∘≢ 14.0
{⍺(≢⍵)} 14.1
{⍺,≢⍵} 14.1 for numeric left arguments
{≢∪⍵} 14.1
{⍺(≢∪⍵)} 14.1
{⍺,≢∪⍵} 14.1 for numeric left arguments
{⊂⍵} and ⊢∘⊂ 14.0
{⍺⍵} 14.1
{⍺} and 14.1 ⊣⌸x ←→ ∪x if were extended like

Additional special cases can be implemented as required.

Special Cases

Since magic functions are so terse, sometimes it is worthwhile to make special cases which would not be made special cases if the implementation were more verbose and/or require more effort. The implementation of ∘.≡ (in the Development Speed section above) illustrates the point. In general, x∘.≡y ←→ ((,⍺)⍳⍺)∘.=((,⍺)⍳⍵). However, if x is a vector, the two ravels can be elided; moreover, if x and y are equal as C pointers, the two uses of can be reduced to one use (not only that, but to a “selfie” ⍵⍳⍵ if a vector). So instead of the one case:

MAGIC("((,⍺)⍳⍺)∘.=((,⍺)⍳⍵)");

we have

switch(CASE(c,eq,r))
{
    …
  case CASE(0,1,0): MAGIC("((,⍺)⍳⍺)∘.=((,⍺)⍳⍵)"); break;
  case CASE(0,1,1): MAGIC("(  ⍺ ⍳⍺)∘.=(  ⍺ ⍳⍵)"); break;
    …
  case CASE(1,1,0): MAGIC("∘.=⍨(,⍵)⍳⍵");          break;
  case CASE(1,1,1): MAGIC("∘.=⍨  ⍵ ⍳⍵");          break;
}

The extra cases are not too burdensome, and their detection is done in scalar code at which C is better than APL. The following benchmarks illustrate the difference that the special cases make:

   t ←' zero one two three four five six seven eight nine'
   t,←' zéro un deux trois quatre cinq six sept huit neuf'
   t,←' zero eins zwei drei vier fünf sechs sieben acht neun'
   u←1↓¨(' '=t)⊂t

   x←u[?300⍴≢u]    ⍝ vector selfie vs. non-selfie
   y←(⍴x)⍴x
   cmpx 'x∘.≡x' 'x∘.≡y'
 x∘.≡x → 1.48E¯4 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
 x∘.≡y → 1.93E¯4 | +30% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

   x1←(1,⍴x)⍴x     ⍝ matrix selfie vs. non-selfie
   y1←(⍴x1)⍴x1
   cmpx 'x1∘.≡x1' 'x1∘.≡y1'
 x1∘.≡x1 → 1.50E¯4 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
 x1∘.≡y1 → 1.97E¯4 | +31% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

To be continued…Part II will describe the benefits from future improvements and APL as a tool of implementation.

A Dialog on APL

A discussion between Nicolas Delcros and Roger Hui

Nicolas, Prologue: From a language point of view, thanks to Ken Iverson, it is obvious that you want grade rather than sort as a primitive. Yet from a performance point of view, sort is currently faster than grade.

Can one be “more fundamental” than the other? If so, who’s wrong…APL or our CPUs? In any case, what does “fundamental” mean?

Roger: Sorting is faster than grading due to reasons presented in the J Wiki essay Sorting versus Grading, in the paragraph which begins “Is sorting necessarily faster than grading?” I can not prove it in the mathematical sense, but I believe that to be the case on any CPU when the items to be sorted/graded are machine units.

Nicolas, Formulation 1: Now, parsing ⍵[⍋⍵] (and not scanning the idiom) begs the question of how deeply an APL intepreter can “understand” what it’s doing to arrays.

How would an APL compiler resolve this conjunction in the parse tree? Do you simply have a bunch of state pointers such as “is the grade of” or “is sorted” or “is squozen” or “axis ordering” etc. walking along the tree? If so, do we have an idea of the number of such state pointers required to exhaustively describe what the APL language can do to arrays? If not, is there something more clever out there?

Roger: I don’t know of any general principles that can tell you what things can be faster. I do have two lists, one for J and another for Dyalog. A big part of the lists consists of compositions of functions, composition in the mathematical sense, that can be faster than doing the functions one after the other if you “recognize” what the composed function is doing and write a native implementation of it. Sort vs. grade is one example (sort is indexing composed with grade). Another one is (⍳∘1 >) or (1 ⍳⍨ >). The function is “find the first 1” composed with >. These compositions have native implementations and:

      x←?1e6⍴1e6

      cmpx '5e5(1 ⍳⍨ >)x' '(5e5>x)⍳1' 
5e5(1 ⍳⍨ >)x → 0.00E0  |       0%
(5e5>x)⍳1    → 1.06E¯2 | +272100% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

      cmpx '¯1(1 ⍳⍨ >)x' '(¯1>x)⍳1' 
¯1(1 ⍳⍨ >)x → 2.41E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
(¯1>x)⍳1    → 4.15E¯3 | +71% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

If you get a “hit” near the beginning, as would be the case with 5e5, you win big. Even if you have to go to the end (as with ¯1), you still save the cost of explicitly generating the Boolean vector and then scanning it to the end.

Another one, introduced in 14.1, is:

      cmpx '(≢∪)x' '≢∪x'
(≢∪)x → 4.43E¯3 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
≢∪x   → 1.14E¯2 | +157% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

This is the tally nub composition, used often in a customer application. If you “know” that you just want the tally of the nub (uniques), you don’t actually have to materialise the array for the nub.

I am not conversant with compiler technology so I don’t know what all an APL compiler can do. I do know that there’s a thing call “loop fusion” where, for example, in a+b×c÷2, it doesn’t have to go through a,b,c in separate loops, but can instead do

    :for i :in ⍳≢a ⋄ z[i]←a[i]+b[i]×c[i]÷2 ⋄ :endif

saving on the array temp on every step. You win some with this, but I think the function composition approach wins bigger. On the other hand, I don’t know that there is a general technique for function composition. I mean, what general statements can you make about what things can be faster (AKA algebraically simpler)?

Nicholas: I sort of see…so jot is a “direct” conjunction. An indirect conjunction could be ⍵[⍺⍴⍋⍵] where the intermediate grade is reshaped. We “know” that grade and shape are “orthogonal” and can rewrite the formula to ⍴⍵[⍋⍵].

So if we can establish a list of flags, and establish how primitives touch these flags, and how these flags affect each other, then we can extend to any path of the parse tree, provided that the intermediate nodes don’t destroy the relationship between the two operations (here grade + indexing provides sort, independently of the reshape).

Of course we can spend our lives finding such tricks. Or we could try and systemise it.

Roger: What you are saying above is that reshape and indexing commute (i.e. reshape indexing ←→ indexing reshape). More generally, compositions of the so-called structural functions and perhaps of the selection functions are amenable to optimisations. This would especially be the case if arrays use the “strided representation” described in Nick Nickolov’s paper Compiling APL to JavaScript in Vector. I used strided representation implicitly to code a terse model of ⍺⍉⍵ in 1987.

Nicolas, Formulation 2: On which grounds did the guys in the ’50s manage to estimate the minimal list of operations that you needed to express data processing?

Roger: APL developed after Ken Iverson struggled with using conventional mathematical notation to teach various topics in data processing. You can get an idea of the process from the following papers:

In our own humble way, we go through a similar process: We talk to customers to find out what problems they are faced with, what things are still awkward, and think about what if anything we can do to the language or the implementation to make things better. Sometimes we come up with a winner, for example . You know, the idea for (grade) is that often you don’t just use ⍋x to order x (sort) but you use ⍋x to order something else. Similarly with , you often don’t want just x⍳y but you use it to apply a function to items with like indices. The J Wiki essay Key describes how the idea arose in applications, and then connected with something I read about, the Connection Machine, a machine with 64K processors (this was in the 1980s).

Nicolas, Formulation 3: Do we have something with a wider spectrum than “Turing complete or not” to categorise the “usefulness and/or efficiency” of a language?

Roger: Still no general principles, but I can say this:

  • Study the languages of designers whom you respect, and “borrow” their primitives, or at least pay attention to the idea. For example, =x is a grouping primitive in k. Symbols are Arthur Whitney’s most precious resource and for him to use up a symbol is significant.

    =x ←→ {⊂⍵}⌸x   old k definition

    =x ←→ {⍺⍵}⌸x   current k definition

    Both {⊂⍵}⌸x (14.0) and {⍺⍵}⌸x (14.1) are supported by special code.

  • Study the design of machines. For a machine to make something “primitive” is significant. For example, the Connection Machine has an instruction “generalised beta” which is basically {f⌿⍵}⌸. Around the time the key operator was introduced, we (Ken Iverson and I) realised that it’s better not to build reduction into a partitioning, so that if you actually want to have a sum you have to say +⌿⌸ rather than just +⌸. The efficiency can be recovered by recognising {f⌿⍵}⌸ and doing a native implementation for it. (Which Dyalog APL does, in 14.0.)

Roger, Epilogue: To repeat Nicolas’ question in the opening section, what does “fundamental” mean? (Is grade more fundamental than sort?)

In computability, we can use Turing completeness as the metric. But here the requirement is not computability but expressiveness. I don’t know that there is a definitive answer to the question. One way to judge the goodness of a notation, or of an addition to the existing notation such as dfns or key or rank, is to use it on problems from diverse domains and see how well it satisfies the important characteristics of notation set out in Ken Iverson’s Turing lecture:

  • ease of expressing constructs arising in problems
  • suggestivity
  • subordination of detail
  • economy
  • amenability to formal proofs

I would add to these an additional characteristic: beauty.

Exploring Key

In ⍺ f⌸ ⍵, major cells of specify keys for the corresponding major cells of , and f applies to each unique major cell of and the major cells of having that key. The monadic case f⌸⍵ is equivalent to ⍵ f⌸ ⍳≢⍵. Key is similar to the GROUP BY clause in SQL.

For example:

   ⎕io←0

   (⍳11) ,[¯0.5] 'Mississippi'
0 1 2 3 4 5 6 7 8 9 10
M i s s i s s i p p  i

   {⍺}⌸'Mississippi'                 {⍺⍵}⌸'Mississippi'
Misp                              ┌─┬────────┐
                                  │M│0       │
   {⊂⍵}⌸'Mississippi'             ├─┼────────┤
┌─┬────────┬───────┬───┐          │i│1 4 7 10│
│0│1 4 7 10│2 3 5 6│8 9│          ├─┼────────┤
└─┴────────┴───────┴───┘          │s│2 3 5 6 │
                                  ├─┼────────┤
   {≢⍵}⌸'Mississippi'             │p│8 9     │
1 4 4 2                           └─┴────────┘

We illustrate key by using it to model some APL primitives and to motivate and solve a programming puzzle.

Index-Of and Friends

Since key is defined in terms of unique and indices, we expect to be able to effect members of the index-of family. And so we can.

The specifications say that f applies to each unique major cell of . Therefore, {⍺}⌸ and equivalently ⊣⌸ computes unique. Since the specifications are for unique major cells, these are what ∪⍵ would be if it were extended to non-vector arguments, analagous to the way was extended to non-vector left arguments.

   x←↑'Aspen' 'John' 'Susan' 'Roger' 'Opal' 'John' 'Aspen'
   x
Aspen
John 
Susan
Roger
Opal 
John 
Aspen

   {⍺}⌸x                  ⊣⌸x
Aspen                  Aspen
John                   John 
Susan                  Susan
Roger                  Roger
Opal                   Opal 

A little surprisingly, key can also be used to model extended index-of and extended membership. For reasons which will become clear, we model (AKA ∊⍨) instead of . Both models use the merge operator, whimsically denoted here by .

ix  ← {(≢⍺) ↓ (0⍴⍨(≢⍺)+≢⍵) ((∊i)→)⍨ (≢¨i)/(≢⍺)⌊⊃¨i←{⊂⍵}⌸⍺⍪⍵}
has ← {(≢⍺) ↓ (0⍴⍨(≢⍺)+≢⍵) ((∊i)→)⍨ (≢¨i)/(≢⍺)>⊃¨i←{⊂⍵}⌸⍺⍪⍵}

   y←↑'China' 'Susan' 'John' 'John' 'Anne' 'Roger'
   y
China
Susan
John 
John 
Anne 
Roger

   x ix y
7 2 1 1 7 3

   x has y
0 1 1 1 0 1

To focus on essentials, ix and has do not directly handle the case where the right argument has higher rank than the left argument (the principal argument). Such higher rank arguments can be handled by pre-application of {↑,⊂⍤(¯1+⍴⍴⍺)⊢⍵} and post-application of ((1-⍴⍴⍺)↓⍴⍵)⍴.

Partition

Since key can effect non-contiguous partitions, we expect to be able to use it to effect contiguous partitions, a special case of the former. And so we can:

   part ← {(+\b/⍺){⊂⍵}⌸b⌿⍵⊣b←∨\⍺}

   x
Aspen
John 
Susan
Roger
Opal 
John 
Aspen
   0 1 0 1 0 0 1 part x
┌─────┬─────┬─────┐
│John │Roger│Aspen│
│Susan│Opal │     │
│     │John │     │
└─────┴─────┴─────┘
   0 1 0 1 0 0 1 (⊂[0] ≡ part) x
1
   (7⍴0) (⊂[0] ≡ part) x
1

Sort and Grade

sort  ← {⍺⌿⍨¯1+{≢⍵}⌸⍺⍪⍵}
grade ← {(≢⍺)-⍨∊1↓¨{⊂⍵}⌸⍺⍪⍵}

If is an array from a small “alphabet” , then ⍺ sort ⍵ sorts and ⍺ grade ⍵ grades it. For example:

   x←?1e6⍴256                        a←⎕a[?1e6⍴26]

   (⍋x)  ≡ (⍳256) grade x            (⍋a)  ≡ ⎕a grade a
1                                 1
   x[⍋x] ≡ (⍳256) sort  x            a[⍋a] ≡ ⎕a sort  a
1                                 1

The Maximum of a Vector

In Dyalog ’13 session D11, I included a slide giving relative timings on the then new key operator:

x←?1e6⍴1000

a. ¯1+{≢⍵}⌸(⍳1000),x     1.00
b. ¯1+{≢⍵}⌸(⍳1+⌈/x),x    1.62
c. ...

The timings show that key with tally takes less than a factor of 2 over finding the maximum. This is fast; unbelievably fast even. The following puzzle arose from investigations into this timing: Find the maximum of a vector of unsigned 1-byte integers without using multicore, vector instructions, loop unrolling, etc. Can you do it faster in C than the following?

max=*x++; 
for(i=1;i<n;++i){if(max<*x)max=*x; ++x;}

I posed the puzzle to some expert C programmers, and they failed to solve it. An “obvious” solution obtains by considering an obvious implementation of {≢⍵}⌸x in C:

M=256;
memset(c,0,M*4);              // initialize table of counts
for(i=0;i<n;++i)c[*x++]++;    // analoguous to c[x]+←1 in APL

Therefore, one way to find the maximum of x is to have an M-element (byte) Boolean table b, and:

memset(b,0,M);
for(i=0;i<n;++i)b[*x++]=1;    // analoguous to b[x]←1 in APL
c=b+M;                        // scan from the back
for(i=0;i<M;++i)if(*--c){max=c-b; break;}

Timing show that the table method is faster than the comparison method by a factor of 1.5 for 1-byte integers and 1.3 for 2-byte integers. The table method requires more code, but the main n-loop, the time for which dominates the overall time, is simpler for the table method.

Jay Foad countered that finding the maximum (and the minimum) is much faster still with the vector instructions available in modern CPUs. Dyalog has now (version 14.1) implemented such finding, with performance benefits for key, the index-of family, grade, and sort.