A Speed-Up Story

The first e-mail of the work week came from Nicolas Delcros. He wondered whether anything clever can be done with ∘.≡ on enclosed character strings. I immediately thought of using “magic functions“, an implementation technique whereby interpreter facilities are coded as dfns. I thought of magic functions because the APL expressions involved in this case are particularly terse:

   t ←' zero one two three four five six seven eight nine'
   t,←' zéro un deux trois quatre cinq six sept huit neuf'
   t,←' zero eins zwei drei vier fünf sechs sieben acht neun'
   b←⌽a←1↓¨(' '=t)⊂t

   cmpx 'a∘.≡a' '∘.=⍨⍳⍨a'
  a∘.≡a   → 9.69E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  ∘.=⍨⍳⍨a → 8.21E¯6 | -92% ⎕⎕
   cmpx 'a∘.≡b' '(a⍳a)∘.=(a⍳b)'
  a∘.≡b         → 9.70E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (a⍳a)∘.=(a⍳b) → 1.43E¯5 | -86% ⎕⎕⎕⎕

   y←⌽x←300⍴a

   cmpx 'x∘.≡x' '∘.=⍨⍳⍨x'
  x∘.≡x   → 9.53E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  ∘.=⍨⍳⍨x → 1.52E¯4 | -99% ⎕
   cmpx 'x∘.≡y' '(x⍳x)∘.=(x⍳y)'
  x∘.≡y         → 9.55E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (x⍳x)∘.=(x⍳y) → 1.95E¯4 | -98% ⎕

The advantage will be even greater if/when we speed up . (And, obviously, the idea is also applicable to ∘.≢ : just replace with and = with .)

Jay Foad objected that the comparisons above aren’t quite fair as the more verbose expressions should check that either ⎕ct←0 or that the arguments do not contain any elements subject to tolerant comparison. I countered that the checking in C would not greatly affect the benchmarks as the time to do the checking is O(m+n) but the benefits are O(m×n) .

Since the factors are so promising and the coding relatively easy, I went ahead and did the work, with the following results:

14.1 14.0 ratio cost of checking
a∘.≡a 9.16e¯6 9.69e¯5 10.58 1.09
a∘.≡b 1.53e¯5 9.70e¯5 6.34 1.08
x∘.≡x 1.48e¯4 9.53e¯3 64.39 1.00
x∘.≡y 1.94e¯4 9.55e¯3 49.23 1.01

The last column (the cost of checking that tolerant comparison is not involved) is under 10% and decreases as the argument sizes increase.

This work is another illustration of the ubiquity and practical usefulness of the “selfie” concept — in the new (14.1) implementation, x∘.≡y is faster when the left and right arguments are the same than when they are not. In particular, the selfie x⍳x or ⍳⍨x occurs twice, and bolsters the point made in item 3 of Sixteen APL Amuse-Bouches:

x⍳x are like ID numbers; questions of identity on x can often be answered more efficiently on x⍳x than on x itself.

Finally, after all that, I recommend that one should consider using x⍳y before using x∘.≡y . The latter requires O(m×n) space and O(m×n) time, and is inherently inefficient.

Selective Expression Tracing

Operator trace from dfns.dws is a simple and low-tech tool for displaying intermediate results of functions, during the evaluation of an expression. For example, what’s going on here?

      (+⌿ ÷ ≢) 1 2 3 4
2.5

We can guess that the above fork is computing the average of the right argument. To see how this works:

      )copy dfns trace
...

⍝ Rename to reduce clutter

      t ← trace

⍝ Bind each "tine" of the fork with t to see what's happening

      (+⌿t ÷t ≢t) 1 2 3 4
≢  1 2 3 4  =>  4
+⌿  1 2 3 4  =>  10
10  ÷  4  =>  2.5
2.5

⍝ But how is that +⌿ reduction evaluated?

      (+t⌿ ÷ ≢) 1 2 3 4
3  +  4  =>  7
2  +  7  =>  9
1  +  9  =>  10
2.5

As a second example, what does the atop (~⍷) do in the following function to remove superfluous '.' characters from its argument?

      {('··'(~⍷)⍵)/⍵} 'squeeze·····me'
squeeze·me

      {('··'(~⍷)t⍵)/⍵} 'squeeze·····me'
··  ~⍷  squeeze·····me  =>  1 1 1 1 1 1 1 0 0 0 0 1 1 1
squeeze·me

…and so forth. Notice how t can be injected to the right of any of the functions in an expression.

For more examples, see the notes for trace.

Data-driven Conditionals

ACKNOWLEDGEMENT: My thanks to Andy Shiers for simplifying the examples

Visitors to APL from other languages are sometimes a bit sniffy about our use of numbers (1 0) for Boolean values True and False. They’re further bemused by our fondness for moving conditional processing out of code and into data. For example, instead of:

   0=2|⍵: ⍵÷2 ⋄ ⍵       ⍝ even number: half

we might say:

   ⍵ ÷ 1+0=2|⍵          ⍝ half if even

At first blush, this seems a bit loony; such transformations appear to produce code which is both slower and more obscure! Here’s another example:

   2|⍵: 1+3×⍵ ⋄ ⍵       ⍝ odd: one more than its triple

vs:

   p ← 2|⍵              ⍝ parity
   p + ⍵ × 1+2×p        ⍝ one more than triple, if odd

But think arrays! Data-driven processing of conditionals is inherently parallel and so performs significantly better for large arrays.

The Collatz conjecture asserts that the sequence “half if even, else one plus triple if odd” always converges to 1. For example, for the number 7 this sequence is:

   7 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1

To test this up to a million we can borrow time, a coarse-grained timing operator, from dfns.dws:

      scalar←{                                  
          {                 ⍝ scalar function
              2|⍵:1+3×⍵     ⍝ odd              
              ⍵÷2           ⍝ even             
          }⍣{⍺=1}¨⍵         ⍝ applied under each
      }


      vector←{         
          {                 ⍝ vector function                                              
              p←2|⍵         ⍝ parity                   
              u←⍵÷2-p       ⍝ half of evens            
              v←p+u×1+2×p   ⍝ 1+3× odds                
              ∪v~1          ⍝ without 1s and duplicates
          }⍣{⍺≡⍬}⍵          ⍝ applied to whole vector                           
      }  


      scalar time ⍳1e6      ⍝ scalar processing 8¾ minutes
08:46.03
 
      vector time ⍳1e6      ⍝ array processing 8¾ seconds
08.74

In the above, time and are high order “operators”.

Defined operator time takes a function left operand, which it applies to its argument, returning the execution time in minutes and seconds.

Primitive operator (power or fixpoint) applies its left operand function repeatedly until its right operand function, which is supplied with the result () and argument of each application, returns True (1).

For further discussion of the Collatz function, see http://dfns.dyalog.com/n_osc.htm.

Why Dyalog ’14 was a Shear Delight

rot55

Recent versions of Microsoft Windows support touch screens, which of course means that applications can respond to events originating from touches. Microsoft calls these events “gestures”.

Dyalog decided to add support for gestures to version 14.1 and so projects were planned, designs were designed, code was coded and, at Dyalog ’14 (#Dyalog14), a demonstration was demonstrated. The video of that demonstration is here

The three most interesting gestures are “pan”, “zoom” and “rotate”; I needed a good demo for the rotate gesture. I had an idea but was unsure how to implement it, so I decided to ask the team. Here’s the email I sent (any typos were included in the original email!):

“Does any have code (or an algorithm, but code ‘d be better) to rotate a matrix by an arbitrary angle. The resulting matrix may then be larger than the original but will be filled with an appropriate value.
Does the question makes sense? For the conference I want to demonstrate 14.1 support for Windows touch gestures, one of which is a “rotate” gesture, and I thought that a good demo would be to rotate a CD cover image when the gesture is detected. So I need to be able to rotate the RGB values that make up the image.”

There were a number of helpful responses, mainly about using rotation matrices, but what intrigued me was Jay Foad’s almost throw away comment: “It’s also possible to do a rotation with repeated shears”

That looked interesting, so I had a bit of a Google and found an article on rotation by shearing written by Tobin Fricke. The gist of this article is that shearing a matrix three times (first horizontally, then vertically and then again horizontally) effects a rotation. An interesting aspect of this (as compared to the rotation method) is that no information is lost. Every single pixel is moved and there is no loss of quality of the image.

I’m no mathematician, but I could read my way though enough of that to realise that it would be trivial in APL. The only “real” code needed is to figure how much to shear each line in the matrix.

Converting Fricke’s α=-tan(Θ/2) and ϐ=sin(Θ), we get (using for Θ):

a←-3○⍺÷2
b←1○⍺

and so, for a given matrix, the amounts to shift are given by:

⌊⍺×(⍳≢⍵)-0.5×≢⍵

where is either a or b, depending on which shift we are doing.

Wrapping this in an operator (where ⍺⍺ is either or ):

shear←{
(⌊⍺×(⍳≢⍵)-0.5×≢⍵)⍺⍺ ⍵
}

Our three shears of matrix , of size s, can therefore be written as:

a⌽shear b⊖shear a⌽shear(3×s)↑(-2×s)↑⍵

Note the resize and pad of the matrix before we start so that we don’t replicate the edges of the bitmap. The padding elements are zeros and display as black pixels. They have been removed from the images below for clarity.

The final rotate function is:

rotate←{
shear←{(⌊⍺×(⍳≢⍵)-0.5×≢⍵)⍺⍺ ⍵}
a←-3○⍺÷2 ⋄ b←1○⍺ ⋄ s←⍴⍵
a⌽shear b⊖shear a⌽shear(3×s)↑(-2×s)↑⍵
}

Let’s look at a 45° rotate of a bitmap called “jd”.

jd.CBits←(○.25)rotate jd.CBits

Prior to performing any shears our image is:

rot0

After the first shear it becomes:

rot1

After the second shear it is:

rot2

And after the third shear it is:

rot3

All nicely rotated.

Isolated Mandelbrot Set Explorer

mandelbrot-equation

It amazes me that an equation so simple can produce images so beautiful and complex.  I was first introduced to the Mandelbrot set by the August 1985 issue of Scientific American. Wikipedia says “The Mandelbrot set is the set of complex numbers ‘c’ for which the sequence ( c , c² + c , (c²+c)² + c , ((c²+c)²+c)² + c , (((c²+c)²+c)²+c)² + c , …) does not approach infinity.” This means that in the images below, the areas in black are points in the Mandelbrot set and the other point colors are determined by how quickly that point “zooms” off towards infinity. The links above give a deeper discussion on the Mandelbrot set – for this post I’d like to focus on how I was able to speed my Mandelbrot set explorer up using isolates.

Mandelbrot Set
I find I learn things more easily through application than abstraction. In 2006 I wanted to learn more about the built-in GUI features of Dyalog for Windows and the application I chose was a Mandelbrot set explorer. While the basic algorithm for calculating a point in the Mandelbrot set is fairly simple, a 640×480 pixel image could involve over 75 million calculations. As new versions of Dyalog are released, I try to experiment with any applicable new features and possibly incorporate them into the explorer. For instance, when complex numbers were introduced in Dyalog, I modified the explorer to use them.

Dyalog v14.0 introduced a prototype of a parallel computing feature called isolates that may one day be integrated into the interpreter. Isolates behave much like namespaces except that they run in their own process with their own copy of the interpreter. Each process can run on a separate processor, greatly improving performance. You can even set up isolate servers on other machines.

Mandelbrot set calculations are ideal candidates for parallelization as:

  • they’re CPU intensive
  • each pixel can be calculated independently of others
  • there are no side effects and no state to maintain

The core calculation is shown below. It takes the real and imaginary parameters for the top left corner, the height and width of the image and the increment along each dimension and uses those parameters to build a set of co-ordinates, then iterates to see how quickly each co-ordinate “escapes” the Mandelbrot set.

 ∇ r←real buildset_core imaginary;height;i_incr;top;width;r_incr;left;set;coord;inds;cnt;escaped
  ⍝ Calculate new Mandelbrot set - remember using origin 0 (⎕IO←0)
  ⍝ real and imaginary are the parameters of the space to calculate
  ⍝       real            imaginary
  ⍝ [0]   width           height            of the image
  ⍝ [1]   increment size  increment size    along each axis
  ⍝ [2]   top             left              coordinates

   (height i_incr top)←imaginary            ⍝ split imaginary argument parameters
   (width r_incr left)←real                 ⍝ split real argument parameters
   set←coord←,(0J1×top+i_incr×⍳height)∘.+(left+r_incr×⍳width) ⍝ create all points in the image
   inds←⍳≢r←(≢coord)⍴0                      ⍝ initialize the result and create vector of indices
   :For cnt :In 1↓⍳256                      ⍝ loop up to 255 times (the size of our color palette)
       escaped←4<coord×+coord               ⍝ mark those that have escaped the Mandelbrot set
       r[escaped/inds]←cnt                  ⍝ set their index in the color palette
       (inds coord)←(~escaped)∘/¨inds coord ⍝ keep those that have not yet escaped
       :If 0∊⍴inds ⋄ :Leave ⋄ :EndIf        ⍝ anything left to do?
       coord←set[inds]+×⍨coord              ⍝ the core Mandelbrot computation... z←(z*2)+c
   :EndFor
   r←(height,width)⍴r                       ⍝ reshape to image size
 ∇

So, what’s involved in using isolates to parallelize the explorer?  As it turns out, not much! First I )COPYed the isolate namespace from the isolate workspace distributed with Dyalog v14.0. I wanted to be able to select the number of processors to use, so I added a line to my init function to query how many processors are available:

 processors←1111⌶⍬ ⍝ query number of processors

Next I wrote a small function to initialize isolates for as many processors as were selected (2 or more). Notice that when creating the isolates, I copy buildset_core (from above) into each one:

 ∇ init_isolates processors
  ⍝ called upon initialization, or whenever user changes number of processors to use
   f.isomsg.Visible←1    ⍝ turn form's "Isolates initializing" message on
   {}isolate.Reset''     ⍝ reset the isolates
   {}isolate.Config'processors'processors ⍝ set the number of processors
   :If processors>1      ⍝ if using more than one processor
       ⍝ create the isolates, populating each one with buildset_core
       isolates←isolate.New¨processors⍴⎕NS'buildset_core'
   :EndIf
   f.isomsg.Visible←0    ⍝ turn the form message off
 ∇

Finally, I wrote a routine to share the work between the selected number of processors and assemble the results back together.

 ∇ r←size buildset rect;top;bottom;left;right;width;height;vert;horz;rows;incr;imaginary;real
  ⍝ Build new Mandelbrot Set using isolates if specified
  ⍝ rect is the top, bottom, left, and right
  ⍝ size is the height and width of the image
   (top bottom left right)←rect
   (height width)←size
   vert←bottom-top                             ⍝ vertical dimension
   horz←right-left                             ⍝ horizontal dimension
   rows←processors{¯2-/⌈⍵,⍨(⍳⍺)×⍵÷⍺}height     ⍝ divide image among processors (height=+/rows)
   incr←vert÷height                            ⍝ vertical increment
   imaginary←↓rows,incr,⍪top+incrׯ1↓0,+\rows  ⍝ imaginary arguments to buildset_core
   real←width,(horz÷width),left                ⍝ real arguments to buildset_core
   :If processors=1                            ⍝ if not using isolates
       r←real buildset_core⊃imaginary          ⍝ build the image here
   :Else
       r←⊃⍪/(⊂real)isolates.buildset_core imaginary ⍝ otherwise let the isolates to it
   :EndIf
 ∇

The images below depicts what happens with 4 processors. Each processor builds a slice of the image and the main task assembles them together.
Drawing1

So, how much faster is it?  Did my 8 processor Intel Core i7 speed things up by a factor of 8? No, but depending on where in the Mandelbrot set you explore, the speed is generally increased by at least a factor of 2 and sometimes as high as 5 or 6. I thought about why this might be and have a couple of ideas…

  • There’s overhead to receive and assemble the results from each isolate.
  • The overall time will be the maximum of the individual times; if one slice is in an area that requires more iterations, then the main task needs to wait for those results, even if the other slices have already finished. I could possibly show this by updating the individual slices as they become available. Maybe next time 🙂

What impressed me was that I could speed up my application significantly with about 20 lines of code and maybe an hour’s effort – most of which was spent tinkering to learn isolates. You can download the Mandelbrot set explorer from my GitHub repository.

Dancing with the Bots

Last week the ‘bots were busy preparing for the J Language Conference in Toronto, where they made their first public appearance together. Upon returning to Bramley they continued training and we are proud to present the first recording of their new dance:

The ‘bots are both running the same DyaBot class as last year. This class exposes a property called Speed, which is a 2-element vector representing the speed of the right and left wheels respectively. Valid values range from +100 (full speed ahead) to -100 (full reverse). The annotations displayed at the top left show the settings used for each step of the dance.

Controlling Two Robots at Once using Isolates

Isolates are a new feature included with Dyalog version 14.0, designed to make it easy to perform distributed processing. In addition to making it easy to used all the cores on your own laptop or workstation, isolates make it possible to harness the power of other machines. This requires the launching of an “isolate server” on each machine that wants to offer its services:

Starting an isolate server using PuTTY to run Dyalog on the robot.

Starting an isolate server on DyaBot00 using PuTTY.

Once we have an isolate server running on each robot we can take control of them from a remote session as follows:

      )load isolate
      #.isolate.AddServer 'dyabot00' (7052)      
      #.isolate.AddServer 'dyabot04' (7052)
      bots←isolate.New¨Bot Bot
      bots.Init
 dyabot00  dyabot04

Above, we create two instances of the Bot namespace. The expression Bots.Init invokes the Init function, which returns the hostname, in each isolate:

:Namespace Bot

    ∇ r←Init;pwd
      pwd←∊⎕SH'pwd' ⍝ Find out where to copy from
      #.⎕CY botws←pwd,'/DyaBot/DyaBot.dws' ⍝ copy ws
      i←⎕NEW #.DyaBot ⍬ ⍝ Make DyaBot instance    
      r←⎕SH'hostname' ⍝ Return hostname
    ∇

:EndNamespace

Next, we define a function “run” that will take a vector of dance steps as input. Each step is a character vector (because that makes editing slightly easier!) containing five numbers: The first two set the speed of one robot, the next two the speed of the other and the fifth defines the duration of the step. After each step we pause for a second, to give humans time to appreciate the spectacle:


    ∇ run cmds;data;i;cmd;z
[1]    ⎕DL 5
[2]    :For i :In ⍳≢cmds
[3]        :If ' '∨.≠cmd←i⊃cmds
[4]            data←1 0 1 0 1⊂2⊃⎕VFI cmd ⍝ Cut into 3 numeric pieces
[5]            z←bots.{i.Speed←⍵}2↑data ⋄ ⎕DL⊃¯1↑data ⋄ z←bots.(i.Speed←0)
[6]            ⎕DL 1
[7]        :EndIf
[8]    :EndFor
    ∇

Now we are ready to roll: Call the run function with a suitable array and watch the robots dance (see the video at the top)!

      ↑choreography
50  50  0   0 1.5
 0   0 50  60 1.2
50 ¯50 50 ¯50 0.3
20  80 10  70 5  
50 ¯50 50 ¯50 0.3
50  50  0   0 1.5
 0   0 50  60 1.2

      dance choreography

Join us again next week to hear what happened when Romilly came to Bramley to help wire up the accelerometer and gyro!