Case Conversion: Mapping and Folding

Dyalog v18.0 introduced ⎕C, which converts the case of characters in an array by mapping to lower case, mapping to upper case, or folding. This superseded an earlier experimental I-beam (819⌶) that could map to lower or upper case but not fold – this I-beam is currently deprecated and will be removed in Dyalog v20.0.

There’s often confusion about the difference between mapping and folding. Mapping is used when you want characters in an array to be in a particular case, whereas folding is used when you want to eliminate case to perform case-insensitive comparisons. The confusion arises because case insensitive comparisons can usually be done well by mapping to lower case rather than folding – indeed, on first inspection, mapping to lower case and folding appear to be the same thing.

The difference is best illustrated through some examples. Used monadically, the I-beam maps to lower case and ⎕C folds:

      819⌶'Hello'
hello
      ⎕C'Hello'
hello

A case-insensitive match function that uses 819⌶ to map both arguments to lower case before checking if they match might look like this:

      cc←≡⍥(819⌶)

In many cases it will appear to work without issues:

      'hello' cc 'HELLO'
1
      'hello' cc 'GOODBYE'
0

However, it doesn’t always work. Greek, for example, has two different lower-case sigma characters (σ and ς) but only one upper case (Σ). ίσως and ΊΣΩΣ are case-insensitively equal, but the function does not work:

      'ίσως' cc 'ΊΣΩΣ'
0

This is because when these two arrays are mapped to lower case they become ίσως and ίσωσ respectively, which do not match. If we use folding instead:

      cc←≡⍥⎕C

the comparisons work as expected:

      'hello' cc 'HELLO'
1
      'hello' cc 'GOODBYE'
0
      'ίσως' cc 'ΊΣΩΣ'
1

This works because folding converts the Greek words to ίσωσ and ίσωσ respectively – every different sigma character, even the lower-case ones, have been changed to σ, and now the two arrays match.

The main use of case conversion is to perform caseless comparisons, so you might wonder why mapping to upper and lower case is supported at all. There are still occasions where you might need that – most notably, when formatting text for display.

If you are still using 819⌶ to perform caseless comparison you should change to using ⎕C to get correct behaviour. And do not forget that 819⌶ will not be supported beyond Dyalog v19.0.

Want to learn more? Adám Brudzewsky explains mapping and folding in this webinar, beginning at 00:06:33.

I-Beam Mnemonics

I-Beam () is an operator that takes as its operand a numeric code and derives a function which isn’t really considered to be part of the APL language – for example: something which could be experimental, which might provide access to parts of the interpreter that should only be accessed with care, or may set up specific conditions within the interpreter to help in testing. Principally, I-Beam functions exist for internal use within Dyalog but as of version 15.0 there are 55 I-Beam functions that are published and available for general use. How on earth are you supposed to remember all the different codes?

To an extent, you are not. It is perhaps no bad thing that they are a little difficult to remember: I-Beam functions are experimental, liable to be changed or even removed and should be used with care. The codes appear to be somewhat random partly because they are somewhat random – mostly selected on the whim of the developer who implemented the function. However, some codes were chosen less randomly than others – quite a few are grouped so that I-Beams that provide related functionality appear consecutively or at least close together but the most interesting ones are “named” – or, at least, given a numeric code that is derived from a name or otherwise memorable value. So here are some of those unofficial mnemonics that may help you better remember them too.

A favourite trick for deriving a code from a meaningful name is to devise a name containing the letters I, V, X, L, C, D and M, and then convert that into a number as if it were a Roman numeral. Thus:

  • “Inverted table IndeX of” is abbreviated to IIX, which is I-Beam 8
  • “Syntax Colouring” rather awkwardly becomes “Cyntax Colouring”, thence CC and I-Beam 200
  • “Called Monadically” is CM, or 900
  • “Memory Manager” is MM – I-Beam 2000
  • “Line Count” is LC, or rather L,C – 50100

Another favourite is to use numbers that look or sound like something else:

  • The compression I-Beam is 219, which looks a bit like “ZIP”
  • The case folding I-Beam is 819, or “BIg”
  • When support for function trains was in development an I-Beam was used to switch between different implementations – 1060, or “lOCO[motive]” (this I-Beam is no longer in use)
  • The number of parallel threads I-Beam is 1111 – four parallel lines
  • The fork I-Beam is 4000: 4000 is 4K; “Four K” said very quickly might sound like “Fork”

For others the function is associated, or can be associated, with something already numeric:

  • The I-Beam to update function timestamps is 1159 – a memorable time
  • The I-Beam to delete the content in unused pockets is 127 – the same as the ASCII code for the delete character
  • The draft JSON standard is RFC 7159 and support for it is implemented as 7159⌶
  • The now-deprecated I-Beam to switch between Random Number Generator algorithms is 16807 – the default value used to derive the random seed in a clear workspace

Tips for the use of I-Beam functions

  • Do not use I-Beam functions directly in application code – always encapsulate their use in cover-functions that can quickly be modified to protect your application in the event that Dyalog should change the behaviour of, or withdraw, an I-Beam function.
  • Do not use I-Beam functions that are not documented by Dyalog in the Language Reference or elsewhere – those designed to test the interpreter may have undesirable side effects.
  • Do let Dyalog know if you find an I-Beam function particularly useful and/or have suggestions for its development. Some new features are initially implemented as I-Beam functions to allow such feedback to shape their final design.

Zero-length Regular Expression Matches Considered Harmful

I was asked by a colleague why ⎕S reports two matches in the following example:

      ('\d*'⎕S 0 1)'321'
┌───┬───┐
│0 3│3 0│
└───┴───┘

Here we are asking for the position and length of sequences of zero or more digits in an input document containing three numeric characters. Intuitively there is just one match of all three characters from the start of the sequence, but ⎕S reports an additional match of zero characters at the end.

The short explanation is this: when there is a match in the input, the matching characters are “consumed” and searching continues from after them. In this example the first match consumes all three characters in the input, leaving it empty. Searching then resumes and, as the pattern matches the zero remaining characters, there is a second match.

If it seems odd that we search when the input is empty then consider the case where the input is empty to start with:

      ('\d*'⎕S 0 1)''
┌───┐
│0 0│
└───┘

In this case we should not be surprised there is a match – the pattern explicitly allows it and, indeed, we would have coded the search pattern differently, perhaps as '\d+', if that was not wanted. It is therefore correct and consistent that the first example matches zero characters as well.

My colleague was not yet convinced. In this case, he asked, why are there not an infinite number of matches at the end?

It’s a good question. Zero-length matches have to be treated specially by the search engine to prevent exactly that, wherever they appear. The rule used by the PCRE search engine, which Dyalog uses, is that any pattern that results in a zero-length match is prohibited from generating another zero-length match until more characters are consumed from the input. PCRE is widely used so this behaviour is commonplace, but other search engine implementations take a different view and simply consume a single character following a zero-length match to prevent the repetition.

That difference in the rules would not have affected our example but consider a slightly more complex search pattern containing alternatives – here, zero or more digits, or one word character:

      ('\d*|\w'⎕S 0 1)'x321'
┌───┬───┬───┬───┐
│0 0│0 1│1 3│4 0│
└───┴───┴───┴───┘

In this example the first match is of zero digits at the start of the input – that is, at offset zero and zero characters long. PCRE then resumes searching at offset zero but without allowing further zero-length matches. Consequently the first alternative pattern cannot match this time but the second can. The next match is therefore a single word character (the 'x'), also at the start of the input. The 'x' is consumed and searching resumes with '321' as the input where, as explained above, there are two further matches making four in total.

Search engines which simply consume a character after a zero-length match would give a different result. After the same initial match of zero characters at the start of the input they would consume the 'x' and resume searching at '321' to give two further matches and a total of only three.

So, patterns which allow zero-length matches appear to be counter-intuitive and the cause of incompatibilities with other search engines. Should we avoid '*' and other qualifiers which allow zero repetitions of a pattern? Of course not – used carefully they can be very effective.

Firstly, a pattern which allows a sequence of zero characters within it can still be constructed so that the overall match can never have a length of zero. For example, 'colou?r' allows zero or one 'u' characters after the second 'o' so that it matches both the British English spelling 'colour' and the American English spelling 'color' – but it never results in a zero-length match, and should never surprise anyone in the way it works.

Secondly, anchors can often be used to prevent those unintuitive extra matches. Anchors in search patterns don’t match actual characters – rather, they allow the match to succeed only at specific locations in the input. '^' and '$' mean the start and end of a line respectively which, used in our very first example, prevent anything from preceding or following a match within a line and exclude the zero-length sequence because it does not satisfy that requirement:

      ('^\d*$'⎕S 0 1)'321'
┌───┐
│0 3│
└───┘

Thirdly – it often doesn’t matter that zero-length matches occur. In this example the search pattern matches any character except newline zero or more times and we are asking for it to be folded to upper-case:

      ('.*' ⎕R '\u&') 'Hello there'
HELLO THERE

The first match is of all 11 characters which are folded to upper-case and replaced. Then is there a second match of zero characters at the end and this too is folded and replaced, having no further effect. You can see this is so by using a different replacement pattern:

      ('.*' ⎕R '[\u&]') 'Hello there'
[HELLO THERE][]

As an aside, a ⎕R search pattern constructed so that it has no visible effect at all can be very useful for doing a quick and simple read of a text file into the workspace. In this example:

      tieno←⎕NTIE 'input.txt'
      ''⎕R''⊢tieno

text is read from the file without modification into a vector of character vectors, one line per element. There are many search and replace patterns that could be used to do this but this is the shortest, and it makes use of zero-length matches and replacements.

Perhaps, after all, the title of this blog should have been “zero-length regex matches considered helpful“?

Further Reading

Further information about the Dyalog search and replace operators is available in the following documents:

Footnote

Edsger W. Dijkstra wrote a paper in 1968 for Communications of the ACM entitled a “Case against the GO TO Statement” but it was published as a letter under the title “The goto statement considered harmful”. “Considered harmful” became a popular phrase in articles written in response to this and in other unrelated papers. There is an article about the phrase in Wikipedia.

Dijkstra also wrote a 1975 paper in which he criticised a number of computer languages he had used. Of APL he said, “APL is a mistake, carried through to perfection. It is the language of the future for the programming techniques of the past: it creates a new generation of coding bums”.

This author is not afraid to use goto statements or APL.