Kawa: Characters and text

Characters and text

Characters

Characters are objects that represent human-readable characters such as letters and digits. More precisely, a character represents a Unicode scalar value. Each character has an integer value in the range 0 to #x10FFFF (excluding the range #xD800 to #xDFFF used for Surrogate Code Points).

Note: Unicode distinguishes between glyphs, which are printed for humans to read, and characters, which are abstract entities that map to glyphs (sometimes in a way that’s sensitive to surrounding characters). Furthermore, different sequences of scalar values sometimes correspond to the same character. The relationships among scalar, characters, and glyphs are subtle and complex.

Despite this complexity, most things that a literate human would call a “character” can be represented by a single Unicode scalar value (although several sequences of Unicode scalar values may represent that same character). For example, Roman letters, Cyrillic letters, Hebrew consonants, and most Chinese characters fall into this category.

Unicode scalar values exclude the range #xD800 to #xDFFF, which are part of the range of Unicode code points. However, the Unicode code points in this range, the so-called surrogates, are an artifact of the UTF-16 encoding, and can only appear in specific Unicode encodings, and even then only in pairs that encode scalar values. Consequently, all characters represent code points, but the surrogate code points do not have representations as characters.

character

A Unicode code point - normally a Unicode scalar value, but could be a surrogate. This is implemented using a 32-bit int. When an object is needed (i.e. the boxed representation), it is implemented an instance of gnu.text.Char.

character-or-eof

A character or the specical #!eof value (used to indicate end-of-file when reading from a port). This is implemented using a 32-bit int, where the value -1 indicates end-of-file. When an object is needed, it is implemented an instance of gnu.text.Char or the special #!eof object.

char

A UTF-16 code unit. Same as Java primitive char type. Considered to be a sub-type of character. When an object is needed, it is implemented as an instance of java.lang.Character. Note the unfortunate inconsistency (for historical reasons) of char boxed as Character vs character boxed as Char.

Characters are written using the notation #\character (which stands for the given character; #\xhex-scalar-value (the character whose scalar value is the given hex integer); or #\character-name (a character with a given name):

character ::= #\any-character
        | #\ character-name
        | #\x hex-scalar-value
        | #\X hex-scalar-value

The following character-name forms are recognized:

#\alarm: #\x0007 - the alarm (bell) character
#\backspace: #\x0008
#\delete
#\del
#\rubout: #\x007f - the delete or rubout character
#\escape
#\esc: #\x001b
#\newline
#\linefeed: #\x001a - the linefeed character
#\null
#\nul: #\x0000 - the null character
#\page: #\000c - the formfeed character
#\return: #\000d - the carriage return character
#\space: #\x0020 - the preferred way to write a space
#\tab: #\x0009 - the tab character
#\vtab: #\x000b - the vertical tabulation character

char? obj

Return #t if obj is a character, #f otherwise. (The obj can be any character, not just a 16-bit char.)

char->integer char

integer->char sv

sv should be a Unicode scalar value, i.e., a non–negative exact integer object in [0, #xD7FF] union [#xE000, #x10FFFF]. (Kawa also allows values in the surrogate range.)

Given a character, char->integer returns its Unicode scalar value as an exact integer object. For a Unicode scalar value sv, integer->char returns its associated character.
(integer->char 32)                     ⇒ #\space
(char->integer (integer->char 5000))   ⇒ 5000
(integer->char #\xD800)                ⇒ throws ClassCastException
Performance note: A call to char->integer is compiled as casting the argument to a character, and then re-interpreting that value as an int. A call to integer->char is compiled as casting the argument to an int, and then re-interpreting that value as an character. If the argument is the right type, no code is emitted: the value is just re-interpreted as the result type.

char=? char_1 char_2 char_3 …

char<? char_1 char_2 char_3 …

char>? char_1 char_2 char_3 …

char<=? char_1 char_2 char_3 …

char>=? char_1 char_2 char_3 …

These procedures impose a total ordering on the set of characters according to their Unicode scalar values.
(char<? #\z #\ß)      ⇒ #t
(char<? #\z #\Z)      ⇒ #f
Performance note: This is compiled as if converting each argument using char->integer (which requires no code) and the using the corresponing int comparison.

digit-value char

This procedure returns the numeric value (0 to 9) of its argument if it is a numeric digit (that is, if char-numeric? returns #t), or #f on any other character.
(digit-value #\3)        ⇒ 3
(digit-value #\x0664)    ⇒ 4
(digit-value #\x0AE6)    ⇒ 0
(digit-value #\x0EA6)    ⇒ #f

Character sets

Sets of characters are useful for text-processing code, including parsing, lexing, and pattern-matching. SRFI 14 specifies a char-set type for such uses. Some examples:

(import (srfi :14 char-sets))
(define vowel (char-set #\a #\e #\i #\o #\u))
(define vowely (char-set-adjoin vowel #\y))
(char-set-contains? vowel #\y) ⇒  #f
(char-set-contains? vowely #\y) ⇒  #t

See the SRFI 14 specification for details.

char-set

The type of character sets. In Kawa char-set is a type that can be used in type specifiers:
(define vowely ::char-set (char-set-adjoin vowel #\y))

Kawa uses inversion lists for an efficient implementation, using Java int arrays to represents character ranges (inversions). The char-set-contains? function uses binary search, so it takes time proportional to the logarithm of the number of inversions. Other operations may take time proportional to the number of inversions.

Strings

Strings are sequences of characters. The length of a string is the number of characters that it contains, as an exact non-negative integer. This number is usually fixed when the string is created, however, you can extend a mutable string with the (Kawa-specific) string-append! function. The valid indices of a string are the exact non-negative integers less than the length of the string. The first character of a string has index 0, the second has index 1, and so on.

Strings are implemented as a sequence of 16-bit char values, even though they're semantically a sequence of 32-bit Unicode code points. A character whose value is greater than #xffff is represented using two surrogate characters. The implementation allows for natural interoperability with Java APIs. However it does make certain operations (indexing or counting based on character counts) difficult to implement efficiently. Luckily one rarely needs to index or count based on character counts; alternatives are discussed below.

Some of the procedures that operate on strings ignore the difference between upper and lower case. The names of the versions that ignore case end with “-ci” (for “case insensitive”).

string

The type of string objects. The underlying type is the interface java.lang.CharSequence. Immultable strings are java.lang.String, while mutable strings are gnu.lists.FString.

Basic string procedures

string? obj

Return #t if obj is a string, #f otherwise.

string char …

Return a newly allocated string composed of the arguments. This is analogous to list.

make-string k

make-string k char

Return a newly allocated string of length k. If char is given, then all elements of the string are initialized to char, otherwise the contents of the string are unspecified.

string-length string

Return the number of characters in the given string as an exact integer object.

Performance note: Calling string-length may take time propertial to the length of the string, because of the need to scan for surrogate pairs.

string-ref string k

k must be a valid index of string. The string-ref procedure returns character k of string using zero–origin indexing.

Performance note: Calling string-ref may take time propertial to k because of the need to check for surrogate pairs. An alternative is to use string-cursor-ref. If iterating through a string, use string-for-each.

string-set! string k char

This procedure stores char in element k of string.
(define s1 (make-string 3 #\*))
(define s2 "***")
(string-set! s1 0 #\?) ⇒ void
s1 ⇒ "?**"
(string-set! s2 0 #\?) ⇒ error
(string-set! (symbol->string 'immutable) 0 #\?) ⇒ error
Performance note: Calling string-set! may take time propertial to the length of the string: First it must scan for the right position, like string-ref does. Then if the new character requires using a surrogate pair (and the old one doesn't) then we have to make rom in the string, possible re-allocating a new char array. Alternatively, if the old character requires using a surrogate pair (and the new one doesn't) then following characters need to be moved.

The function string-set! is deprecated: It is inefficient, and it very seldom does the correct thing. Instead, you can construct a string with string-append!.

substring string start end

string must be a string, and start and end must be exact integer objects satisfying:
0 <= start <= end <= (string-length string)
The substring procedure returns a newly allocated string formed from the characters of string beginning with index start (inclusive) and ending with index end (exclusive).

string-append string …

Return a newly allocated string whose characters form the concatenation of the given strings.

string-append! string value …

The string must be a mutable string, such as one retuned by make-string or string-copy. The string-append! procedure extends string by appending each value (in order) to the end of string.

Performance note: The compiler converts a call with multiple values to a multiple string-append! calls. If a value is known to be a character, then no boxing (object-allocation) is needed.

The following example show to to efficiently process a string using string-for-each and incrementally “building” a result string using string-append!.
(define (translate-space-to-newline str::string)::string
  (let ((result (make-string 0)))
    (string-for-each
     (lambda (ch)
       (string-append! result
                       (if (char=? ch #\Space) #\Newline ch)))
     str)
    result))

string->list string [start [end]]

list->string list

It is an error if any element of list is not a character.

The string->list procedure returns a newly allocated list of the characters of string between start and end. The list->string procedure returns a newly allocated string formed from the characters in list. In both procedures, order is preserved. The string->list and list->string procedures are inverses so far as equal? is concerned.

string-for-each proc string_1 string_2 …

string-for-each proc string_1 [start [end]]

The strings must all have the same length. proc should accept as many arguments as there are strings.

The start-end variant is provided for compatibility with the SRFI-13 version. (In that case start and end count code Unicode scalar values (character values), not Java 16-bit char values.)

The string-for-each procedure applies proc element–wise to the characters of the strings for its side effects, in order from the first characters to the last. proc is always called in the same dynamic environment as string-for-each itself.

Analogous to for-each.
(let ((v '()))
  (string-for-each
    (lambda (c) (set! v (cons (char->integer c) v)))
    "abcde")
   v)
  ⇒ (101 100 99 98 97)
Performance note: The compiler generates efficient code for string-for-each. If proc is a lambda expression, it is inlined,

string-map proc string_1 string_2 …

The string-map procedure applies proc element-wise to the elements of the strings and returns a string of the results, in order. It is an error if proc does not accept as many arguments as there are strings, or return other than a single character. If more than one string is given and not all strings have the same length, string-map terminates when the shortest string runs out. The dynamic order in which proc is applied to the elements of the strings is unspecified.
(string-map char-foldcase "AbdEgH")  ⇒ "abdegh"
(string-map
  (lambda (c) (integer->char (+ 1 (char->integer c))))
  "HAL")
        ⇒ "IBM"
(string-map
  (lambda (c k)
    ((if (eqv? k #\u) char-upcase char-downcase) c))
  "studlycaps xxx"
  "ululululul")
        ⇒ "StUdLyCaPs"
Performance note: The string-map procedure has not been optimized (mainly because it is not very useful): The characters are boxed, and the proc is not inlined even if a lambda expression.

string-copy string [start [end]]

Returns a newly allocated copy of the the part of the given string between start and end.

string-replace! dst dst-start dst-end src [src-start [src-end]]

Replaces the characters of string dst (between dst-start and dst-end) with the characters of src (between src-start and src-end). The number of characters from src may be different than the number replaced in dst, so the string may grow or contract. The special case where dst-start is equal to dst-end corresponds to insertion; the case where src-start is equal to src-end corresponds to deletion. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes places as if the source is first copied into a temporary string and then into the destination. (This is achieved without allocating storage by making sure to copy in the correct direction in such circumstances.)

string-copy! to at from [start [end]]

Copies the characters of the string from that are between start end end into the string to, starting at index at. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes places as if the source is first copied into a temporary string and then into the destination. (This is achieved without allocating storage by making sure to copy in the correct direction in such circumstances.)

This is equivalent to (and implemented as):
(string-replace! to at (+ at (- end start)) from start end))
(define a "12345")
(define b (string-copy "abcde"))
(string-copy! b 1 a 0 2)
b  ⇒  "a12de"

string-fill! string fill [start [end]]

The string-fill! procedure stores fill in the elements of string between start and end. It is an error if fill is not a character or is forbidden in strings.

String Comparisons

string=? string_1 string_2 string_3 …

Return #t if the strings are the same length and contain the same characters in the same positions. Otherwise, the string=? procedure returns #f.
(string=? "Straße" "Strasse")    ⇒ #f

string<? string_1 string_2 string_3 …

string>? string_1 string_2 string_3 …

string<=? string_1 string_2 string_3 …

string>=? string_1 string_2 string_3 …

These procedures return #t if their arguments are (respectively): monotonically increasing, monotonically decreasing, monotonically non-decreasing, or monotonically nonincreasing. These predicates are required to be transitive.

These procedures are the lexicographic extensions to strings of the corresponding orderings on characters. For example, string<? is the lexicographic ordering on strings induced by the ordering char<? on characters. If two strings differ in length but are the same up to the length of the shorter string, the shorter string is considered to be lexicographically less than the longer string.
(string<? "z" "ß")      ⇒ #t
(string<? "z" "zz")     ⇒ #t
(string<? "z" "Z")      ⇒ #f

string-ci=? string_1 string_2 string_3 …

string-ci<? string_1 string_2 string_3 …

string-ci>? string_1 string_2 string_3 …

string-ci<=? string_1 string_2 string_3 …

string-ci>=? string_1 string_2 string_3 …

These procedures are similar to string=?, etc., but behave as if they applied string-foldcase to their arguments before invokng the corresponding procedures without -ci.
(string-ci<? "z" "Z")                   ⇒ #f
(string-ci=? "z" "Z")                   ⇒ #t
(string-ci=? "Straße" "Strasse")        ⇒ #t
(string-ci=? "Straße" "STRASSE")        ⇒ #t
(string-ci=? "ΧΑΟΣ" "χαοσ")             ⇒ #t

String Conversions

string-upcase string

string-downcase string

string-titlecase string

string-foldcase string

These procedures take a string argument and return a string result. They are defined in terms of Unicode's locale–independent case mappings from Unicode scalar–value sequences to scalar–value sequences. In particular, the length of the result string can be different from the length of the input string. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.

The string-upcase procedure converts a string to upper case; string-downcase converts a string to lower case. The string-foldcase procedure converts the string to its case–folded counterpart, using the full case–folding mapping, but without the special mappings for Turkic languages. The string-titlecase procedure converts the first cased character of each word, and downcases all other cased characters.
(string-upcase "Hi")              ⇒ "HI"
(string-downcase "Hi")            ⇒ "hi"
(string-foldcase "Hi")            ⇒ "hi"

(string-upcase "Straße")          ⇒ "STRASSE"
(string-downcase "Straße")        ⇒ "straße"
(string-foldcase "Straße")        ⇒ "strasse"
(string-downcase "STRASSE")       ⇒ "strasse"

(string-downcase "Σ")             ⇒ "σ"
; Chi Alpha Omicron Sigma:
(string-upcase "ΧΑΟΣ")            ⇒ "ΧΑΟΣ"
(string-downcase "ΧΑΟΣ")          ⇒ "χαος"
(string-downcase "ΧΑΟΣΣ")         ⇒ "χαοσς"
(string-downcase "ΧΑΟΣ Σ")        ⇒ "χαος σ"
(string-foldcase "ΧΑΟΣΣ")         ⇒ "χαοσσ"
(string-upcase "χαος")            ⇒ "ΧΑΟΣ"
(string-upcase "χαοσ")            ⇒ "ΧΑΟΣ"

(string-titlecase "kNock KNoCK")  ⇒ "Knock Knock"
(string-titlecase "who's there?") ⇒ "Who's There?"
(string-titlecase "r6rs")         ⇒ "R6rs"
(string-titlecase "R6RS")         ⇒ "R6rs"
Note: The case mappings needed for implementing these procedures can be extracted from UnicodeData.txt, SpecialCasing.txt, WordBreakProperty.txt (the “MidLetter” property partly defines case–ignorable characters), and CaseFolding.txt from the Unicode Consortium.

Since these procedures are locale–independent, they may not be appropriate for some locales.

Note: Word breaking, as needed for the correct casing of the upper case greek sigma and for string-titlecase, is specified in Unicode Standard Annex #29.

Kawa Note: The implementation of string-titlecase does not correctly handle the case where an initial character needs to be converted to multiple characters, such as “LATIN SMALL LIGATURE FL” which should be converted to the two letters "Fl".

string-normalize-nfd string

string-normalize-nfkd string

string-normalize-nfc string

string-normalize-nfkc string

These procedures take a string argument and return a string result, which is the input string normalized to Unicode normalization form D, KD, C, or KC, respectively. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.
(string-normalize-nfd "\xE9;")          ⇒ "\x65;\x301;"
(string-normalize-nfc "\xE9;")          ⇒ "\xE9;"
(string-normalize-nfd "\x65;\x301;")    ⇒ "\x65;\x301;"
(string-normalize-nfc "\x65;\x301;")    ⇒ "\xE9;"

String Cursor API

Indexing into a string (using for example string-ref) is inefficient because of the possible presence of surrogate pairs. Hence given an index i access normally requires linearly scanning the string until we have seen i characters.

The string-cursor API is defined in terms of abstract “cursor values”, which point to a position in the string. This avoids the linear scan.

The API is non-standard, but is based on that in Chibi Scheme.

string-cursor

An abstract posistion (index) in a string. Implemented as a primitive int which counts the number of preceding code units (16-bit char values).

string-cursor-start str

Returns a cursor for the start of the string. The result is always 0, cast to a string-cursor.

string-cursor-end str

Returns a cursor for the end of the string - one past the last valid character. Implemented as (as string-cursor (invoke str 'length)).

string-cursor-ref str cursor

Return the character at the cursor.

string-cursor-next string cursor [count]

Return the cursor position count (default 1) character positions forwards beyond cursor. For each count this may add either 1 or 2 (if pointing at a surrogate pair) to the cursor.

string-cursor-prev string cursor [count]

Return the cursor position count (default 1) character positions backwards before cursor.

substring-cursor string [start [end]]

Create a substring of the section of string between the cursors start and end.

string-cursor<? cursor1 cursor2

string-cursor<=? cursor1 cursor2

string-cursor=? cursor1 cursor2

string-cursor>=? cursor1 cursor2

string-cursor>? cursor1 cursor2

Is the position of cursor1 respectively before, before or same, same, after, or after or same, as cursor2.

Performance note: Implemented as the corresponding int comparison.

string-cursor-for-each proc string [start [end]]

Apply the procedure proc to each character position in string between the cursors start and end.

String literals

Kaw support two syntaxes of string literals: The traditional, portable, qdouble-quoted-delimited literals like "this"; and extended SRFI-109 quasi-literals like &{this}.

Simple string literals

string ::= "string-element^*"
string-element ::= any character other than " or @backslashchar{}
    | mnemonic-escape | @backslashchar{}" | @backslashchar{}@backslashchar{}
    | @backslashchar{}intraline-whitespace^*line-ending intraline-whitespace^*
    | inline-hex-escape
mnemonic-escape ::= @backslashchar{}a | @backslashchar{}b | @backslashchar{}t | @backslashchar{}n | @backslashchar{}r | ... (see below)

A string is written as a sequence of characters enclosed within quotation marks ("). Within a string literal, various escape sequence represent characters other than themselves. Escape sequences always start with a backslash (@backslashchar{}):

@backslashchar{}a: Alarm (bell), #\x0007.
@backslashchar{}b: Backspace, #\x0008.
@backslashchar{}e: Escape, #\x001B.
@backslashchar{}f: Form feed, #\x000C.
@backslashchar{}n: Linefeed (newline), #\x000A.
@backslashchar{}r: Return, #\x000D.
@backslashchar{}t: Character tabulation, #\x0009.
@backslashchar{}v: Vertical tab, #\x000B.
@backslashchar{}C-x
@backslashchar{}^x: Returns the scalar value of x masked (anded) with #x9F. An alternative way to write the Ascii control characters: For example "\C-m" or "\^m" is the same as "#\x000D" (which the same as "\r"). As a special case \^? is rubout (delete) (\x7f;).
@backslashchar{}x hex-scalar-value;
@backslashchar{}X hex-scalar-value;: A hex encoding that gives the scalar value of a character.
@backslashchar{}@backslashchar{} oct-digit^+: At most three octal digits that give the scalar value of a character. (Historical, for C compatibility.)
@backslashchar{}u hex-digit^+: Exactly four hex digits that give the scalar value of a character. (Historical, for Java compatibility.)
@backslashchar{}M-x: (Historical, for Emacs Lisp.) Set the meta-bit (high-bit of single byte) of the following character x.
@backslashchar{}|: Vertical line, #\x007c. (Not useful for string literals, but useful for symbols.)
@backslashchar{}": Double quote, #\x0022.
@backslashchar{}@backslashchar{}: Backslah, #\005C.
@backslashchar{}intraline-whitespace^*line-ending intraline-whitespace^*: Nothing (ignored). Allows you to split up a long string over multiple lines; ignoring initial whitespace on the continuation lines allows you to indent them.

Except for a line ending, any character outside of an escape sequence stands for itself in the string literal. A line ending which is preceded by @backslashchar{}intraline-whitespace^* expands to nothing (along with any trailing intraline-whitespace), and can be used to indent strings for improved legibility. Any other line ending has the same effect as inserting a @backslashchar{}n character into the string.

Examples:

"The word \"recursion\" has many meanings."
"Another example:\ntwo lines of text"
"Here’s text \
containing just one line"
"\x03B1; is named GREEK SMALL LETTER ALPHA."

String templates

The following syntax is a string template (also called a string quasi-literal or “here document”):

&{Hello &[name]!}

Assuming the variable name evaluates to "John" then the example evaluates to "Hello John!".

The Kawa reader converts the above example to:

($string$ "Hello " $<<$ name $>>$ "!")

See the SRFI-109 specification for details.

extended-string-literal ::= &@lbracechar{} [initial-ignored] string-literal-part^* @rbracechar{}
string-literal-part ::=  any character except &, @lbracechar{} or @rbracechar{}
    | @lbracechar{} string-literal-part^* @rbracechar{}
    | char-ref
    | entity-ref
    | special-escape
    | enclosed-part

You can use the plain "string" syntax for longer multiline strings, but &{string} has various advantages. The syntax is less error-prone because the start-delimiter is different from the end-delimiter. Also note that nested braces are allowed: a right brace @rbracechar{} is only an end-delimiter if it is unbalanced, so you would seldom need to escape it:

&{This has a {braced} section.}
  ⇒ "This has a {braced} section."

The escape character used for special characters is &. This is compatible with XML syntax and the section called “XML literals”.

Special characters

char-ref ::=
    &# digit^+ ;
  | &#x hex-digit^+  ;
entity-ref ::=
    & char-or-entity-name ;
char-or-entity-name ::= tagname

You can the standard XML syntax for character references, using either decimal or hexadecimal values. The following string has two instances of the Ascii escape character, as either decimal 27 or hex 1B:

&{&#27;&#x1B;} ⇒ "\e\e"

You can also use the pre-defined XML entity names:

&{&amp; &lt; &gt; &quot; &apos;} ⇒ "& < > \" '"

In addition, { } can be used for left and right curly brace, though you don't need them for balanced parentheses:

&{ &rbrace;_&lbrace; / {_} }  ⇒ " }_{ / {_} "

You can use the standard XML entity names. For example:

&{L&aelig;rdals&oslash;yri}
  ⇒ "Lærdalsøyri"

You can also use the standard R7RS character names null, alarm, backspace, tab, newline, return, escape, space, and delete. For example:

&{&escape;&space;}

The syntax &name; is actually syntactic sugar (specifically reader syntax) to the variable reference $entity$:name. Hence you can also define your own entity names:

(define $entity$:crnl "\r\n")
&{&crnl;} ⟹ "\r\n"

Multiline string literals

initial-ignored ::=
    intraline-whitespace^* line-ending intraline-whitespace^* &|
special-escape ::=
    intraline-whitespace^* &|
  | & nested-comment
  | &- intraline-whitespace^* line-ending

A line-ending directly in the text is becomes a newline, as in a simple string literal:

(string-capitalize &{one two three
uno dos tres
}) ⇒ "One Two Three\nUno Dos Tres\n"

However, you have extra control over layout. If the string is in a nested expression, it is confusing (and ugly) if the string cannot be indented to match the surrounding context. The indentation marker &| is used to mark the end of insignificant initial whitespace. The &| characters and all the preceding whitespace are removed. In addition, it also suppresses an initial newline. Specifically, when the initial left-brace is followed by optional (invisible) intraline-whitespace, then a newline, then optional intraline-whitespace (the indentation), and finally the indentation marker &| - all of which is removed from the output. Otherwise the &| only removes initial intraline-whitespace on the same line (and itself).

(write (string-capitalize &{
     &|one two three
     &|uno dos tres
}) out)
    ⇒ prints "One Two Three\nUno Dos Tres\n"

As a matter of style, all of the indentation lines should line up. It is an error if there are any non-whitespace characters between the previous newline and the indentation marker. It is also an error to write an indentation marker before the first newline in the literal.

The line-continuation marker &- is used to suppress a newline:

&{abc&-
  def} ⇒ "abc  def"

You can write a #|...|#-style comment following a &. This could be useful for annotation, or line numbers:

&{&#|line 1|#one two
  &#|line 2|# three
  &#|line 3|#uno dos tres
} ⇒ "one two\n three\nuno dos tres\n"

Embedded expressions

enclosed-part ::=
    & enclosed-modifier [ expression^* ]
  | & enclosed-modifier ( expression^+ )

An embedded expression has the form &[expression]. It is evaluated, the result converted to a string (as by display), and the result added in the result string. (If there are multiple expressions, they are all evaluated and the corresponding strings inserted in the result.)

&{Hello &[(string-capitalize name)]!}

You can leave out the square brackets when the expression is a parenthesized expression:

&{Hello &(string-capitalize name)!}

Formatting

enclosed-modifier ::=
  ~ format-specifier-after-tilde^*

Using format allows finer-grained control over the output, but a problem is that the association between format specifiers and data expressions is positional, which is hard-to-read and error-prone. A better solution places the specifier adjacant to the data expression:

&{The response was &~,2f(* 100.0 (/ responses total))%.}

The following escape forms are equivalent to the corresponding forms withput the ~fmt-spec, except the expression(s) are formatted using format:

&~fmt-spec[expression^*]

Again using parentheses like this:

&~fmt-spec(expression^+)

is equivalent to:

&~fmt-spec[(expression^+)]

The syntax of format specifications is arcane, but it allows you to do some pretty neat things in a compact space. For example to include "_" between each element of the array arr you can use the ~{...~} format speciers:

(define arr [5 6 7])
&{&~{&[arr]&~^_&~}} ⇒ "5_6_7"

If no format is specified for an enclosed expression, the that is equivalent to a ~a format specifier, so this is equivalent to:

&{&~{&~a[arr]&~^_&~}} ⇒ "5_6_7"

which is in turn equivalent to:

(format #f "~{~a~^_~}" arr)

The fine print that makes this work: If there are multiple expressions in a &[...] with no format specifier then there is an implicit ~a for each expression. On the other hand, if there is an explicit format specifier, it is not repeated for each enclosed expression: it appears exactly once in the effective format string, whether there are zero, one, or many expressions.

Unicode character classes and conversions

Some of the procedures that operate on characters or strings ignore the difference between upper case and lower case. These procedures have -ci (for “case insensitive”) embedded in their names.

Characters

char-upcase char

char-downcase char

char-titlecase char

char-foldcase char

These procedures take a character argument and return a character result.

If the argument is an upper–case or title–case character, and if there is a single character that is its lower–case form, then char-downcase returns that character.

If the argument is a lower–case or title–case character, and there is a single character that is its upper–case form, then char-upcase returns that character.

If the argument is a lower–case or upper–case character, and there is a single character that is its title–case form, then char-titlecase returns that character.

If the argument is not a title–case character and there is no single character that is its title–case form, then char-titlecase returns the upper–case form of the argument.

Finally, if the character has a case–folded character, then char-foldcase returns that character. Otherwise the character returned is the same as the argument.

For Turkic characters #\x130 and #\x131, char-foldcase behaves as the identity function; otherwise char-foldcase is the same as char-downcase composed with char-upcase.
(char-upcase #\i)               ⇒  #\I
(char-downcase #\i)             ⇒  #\i
(char-titlecase #\i)            ⇒  #\I
(char-foldcase #\i)             ⇒  #\i

(char-upcase #\ß)               ⇒  #\ß
(char-downcase #\ß)             ⇒  #\ß
(char-titlecase #\ß)            ⇒  #\ß
(char-foldcase #\ß)             ⇒  #\ß

(char-upcase #\Σ)               ⇒  #\Σ
(char-downcase #\Σ)             ⇒  #\σ
(char-titlecase #\Σ)            ⇒  #\Σ
(char-foldcase #\Σ)             ⇒  #\σ

(char-upcase #\ς)               ⇒  #\Σ
(char-downcase #\ς)             ⇒  #\ς
(char-titlecase #\ς)            ⇒  #\Σ
(char-foldcase #\ς)             ⇒  #\σ
Note: char-titlecase does not always return a title–case character.

Note: These procedures are consistent with Unicode's locale–independent mappings from scalar values to scalar values for upcase, downcase, titlecase, and case–folding operations. These mappings can be extracted from UnicodeData.txt and CaseFolding.txt from the Unicode Consortium, ignoring Turkic mappings in the latter.

Note that these character–based procedures are an incomplete approximation to case conversion, even ignoring the user's locale. In general, case mappings require the context of a string, both in arguments and in result. The string-upcase, string-downcase, string-titlecase, and string-foldcase procedures perform more general case conversion.

char-ci=? char_1 char_2 char_3 …

char-ci<? char_1 char_2 char_3 …

char-ci>? char_1 char_2 char_3 …

char-ci<=? char_1 char_2 char_3 …

char-ci>=? char_1 char_2 char_3 …

These procedures are similar to char=?, etc., but operate on the case–folded versions of the characters.
(char-ci<? #\z #\Z)             ⇒ #f
(char-ci=? #\z #\Z)             ⇒ #f
(char-ci=? #\ς #\σ)             ⇒ #t

char-alphabetic? char

char-numeric? char

char-whitespace? char

char-upper-case? char

char-lower-case? char

char-title-case? char

These procedures return #t if their arguments are alphabetic, numeric, whitespace, upper–case, lower–case, or title–case characters, respectively; otherwise they return #f.

A character is alphabetic if it has the Unicode “Alphabetic” property. A character is numeric if it has the Unicode “Numeric” property. A character is whitespace if has the Unicode “White_Space” property. A character is upper case if it has the Unicode “Uppercase” property, lower case if it has the “Lowercase” property, and title case if it is in the Lt general category.
(char-alphabetic? #\a)          ⇒  #t
(char-numeric? #\1)             ⇒  #t
(char-whitespace? #\space)      ⇒  #t
(char-whitespace? #\x00A0)      ⇒  #t
(char-upper-case? #\Σ)          ⇒  #t
(char-lower-case? #\σ)          ⇒  #t
(char-lower-case? #\x00AA)      ⇒  #t
(char-title-case? #\I)          ⇒  #f
(char-title-case? #\x01C5)      ⇒  #t

char-general-category char

Return a symbol representing the Unicode general category of char, one of Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Ps, Pe, Pi, Pf, Pd, Pc, Po, Sc, Sm, Sk, So, Zs, Zp, Zl, Cc, Cf, Cs, Co, or Cn.
(char-general-category #\a)         ⇒ Ll
(char-general-category #\space)     ⇒ Zs
(char-general-category #\x10FFFF)   ⇒ Cn

Deprecated in-place case modification

The following functions are deprecated; they really don't and cannot do the right thing, because in some languages upper and lower case can use different number of characters.

string-upcase! str

Deprecated: Destructively modify str, replacing the letters by their upper-case equivalents.

string-downcase! str

Deprecated: Destructively modify str, replacing the letters by their upper-lower equivalents.

string-capitalize! str

Deprecated: Destructively modify str, such that the letters that start a new word are replaced by their title-case equivalents, while non-initial letters are replaced by their lower-case equivalents.

Regular expressions

Kawa provides regular expressions, which is a convenient mechanism for matching a string against a pattern and maybe replacing matching parts.

A regexp is a string that describes a pattern. A regexp matcher tries to match this pattern against (a portion of) another string, which we will call the text string. The text string is treated as raw text and not as a pattern.

Most of the characters in a regexp pattern are meant to match occurrences of themselves in the text string. Thus, the pattern “abc” matches a string that contains the characters “a”, “b”, “c” in succession.

In the regexp pattern, some characters act as metacharacters, and some character sequences act as metasequences. That is, they specify something other than their literal selves. For example, in the pattern “a.c”, the characters “a” and “c” do stand for themselves but the metacharacter “.” can match any character (other than newline). Therefore, the pattern “a.c” matches an “a”, followed by any character, followed by a “c”.

If we needed to match the character “.” itself, we escape it, ie, precede it with a backslash “\”. The character sequence “\.” is thus a metasequence, since it doesn’t match itself but rather just “.”. So, to match “a” followed by a literal “.” followed by “c” we use the regexp pattern “a\.c”. To write this as a Scheme string literal, you need to quote the backslash, so you need to write "a\\.c". Kawa also allows the literal syntax #/a\.c/, which avoids the need to double the backslashes.

You can choose between two similar styles of regular expressions. The two differ slightly in terms of which characters act as metacharacters, and what those metacharacters mean:

Functions starting with regex- are implemented using the java.util.regex package. This is likely to be more efficient, has better Unicode support and some other minor extra features, and literal syntax #/a\.c/ mentioned above.
Functions starting with pregexp- are implemented in pure Scheme using Dorai Sitaram's “Portable Regular Expressions for Scheme” library. These will be portable to more Scheme implementations, including BRL, and is available on older Java versions.

Java regular expressions

The syntax for regular expressions is documented here.

regex

A compiled regular expression, implemented as java.util.regex.Pattern.

regex arg

Given a regular expression pattern (as a string), compiles it to a regex object.
(regex "a\\.c")
This compiles into a pattern that matches an “a”, followed by any character, followed by a “c”.

The Scheme reader recognizes “#/” as the start of a regular expression pattern literal, which ends with the next un-escaped “/”. This has the big advantage that you don't need to double the backslashes:

#/a\.c/

This is equivalent to (regex "a\\.c"), except it is compiled at read-time. If you need a literal “/” in a pattern, just escape it with a backslash: “#/a\/c/” matches a “a”, followed by a “/”, followed by a “c”.

You can add single-letter modifiers following the pattern literal. The following modifiers are allowed:

i

The modifier “i” cause the matching to ignore case. For example the following pattern matches “a” or “A”.

#/a/i

m

Enables “metaline” mode. Normally metacharacters “^” and “$' match at the start end end of the entire input string. In metaline mode “^” and “$” also match just before or after a line terminator.

Multiline mode can also be enabled by the metasequence “(?m)”.

s

Enable “singleline” (aka “dot-all”) mode. In this mode the matacharacter “. matches any character, including a line breaks. This mode be enabled by the metasequence “(?s)”.

The following functions accept a regex either as a pattern string or a compiled regex pattern. I.e. the following are all equivalent:

(regex-match "b\\.c" "ab.cd")
(regex-match #/b\.c/ "ab.cd")
(regex-match (regex "b\\.c") "ab.cd")
(regex-match (java.util.regex.Pattern:compile "b\\.c") "ab.cd")

These all evaluate to the list ("b.c").

The following functions must be imported by doing one of:

(require 'regex) ;; or
(import (kawa regex))

regex-match-positions regex string [start [end]]

The procedure regex‑match‑position takes pattern and a text string, and returns a match if the regex matches (some part of) the text string.

Returns #f if the regexp did not match the string; and a list of index pairs if it did match.
(regex-match-positions "brain" "bird") ⇒ #f
(regex-match-positions "needle" "hay needle stack")
  ⇒ ((4 . 10))
In the second example, the integers 4 and 10 identify the substring that was matched. 4 is the starting (inclusive) index and 10 the ending (exclusive) index of the matching substring.
(substring "hay needle stack" 4 10) ⇒ "needle"
In this case the return list contains only one index pair, and that pair represents the entire substring matched by the regexp. When we discuss subpatterns later, we will see how a single match operation can yield a list of submatches.

regex‑match‑positions takes optional third and fourth arguments that specify the indices of the text string within which the matching should take place.
(regex-match-positions "needle"
  "his hay needle stack -- my hay needle stack -- her hay needle stack"
  24 43)
  ⇒ ((31 . 37))
Note that the returned indices are still reckoned relative to the full text string.

regex-match regex string [start [end]]

The procedure regex‑match is called like regex‑match‑positions but instead of returning index pairs it returns the matching substrings:
(regex-match "brain" "bird") ⇒ #f
(regex-match "needle" "hay needle stack")
  ⇒ ("needle")
regex‑match also takes optional third and fourth arguments, with the same meaning as does regex‑match‑positions.

regex-split regex string

Takes two arguments, a regex pattern and a text string, and returns a list of substrings of the text string, where the pattern identifies the delimiter separating the substrings.
(regex-split ":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin")
  ⇒ ("/bin" "/usr/bin" "/usr/bin/X11" "/usr/local/bin")

(regex-split " " "pea soup")
  ⇒ ("pea" "soup")
If the first argument can match an empty string, then the list of all the single-character substrings is returned, plus we get a empty strings at each end.
(regex-split "" "smithereens")
  ⇒ ("" "s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s" "")
(Note: This behavior is different from pregexp-split.)

To identify one-or-more spaces as the delimiter, take care to use the regexp “ +”, not “ *”.
(regex-split " +" "split pea     soup")
  ⇒ ("split" "pea" "soup")
(regex-split " *" "split pea     soup")
  ⇒ ("" "s" "p" "l" "i" "t" "" "p" "e" "a" "" "s" "o" "u" "p" "")

regex‑replace regex string replacement

Replaces the matched portion of the text string by another a replacdement string.
(regex-replace "te" "liberte" "ty")
  ⇒ "liberty"
Submatches can be used in the replacement string argument. The replacement string can use “$n” as a backreference to refer back to the nth submatch, ie, the substring that matched the nth subpattern. “$0” refers to the entire match.
(regex-replace #/_(.+?)_/
               "the _nina_, the _pinta_, and the _santa maria_"
		"*$1*"))
  ⇒ "the *nina*, the _pinta_, and the _santa maria_"

regex‑replace* regex string replacement

Replaces all matches in the text string by the replacement string:

(regex-replace* "te" "liberte egalite fraternite" "ty")
  ⇒ "liberty egality fratyrnity"
(regex-replace* #/_(.+?)_/
                "the _nina_, the _pinta_, and the _santa maria_"
                "*$1*")
  ⇒ "the *nina*, the *pinta*, and the *santa maria*"

regex-quote pattern

Takes an arbitrary string and returns a pattern string that precisely matches it. In particular, characters in the input string that could serve as regex metacharacters are escaped as needed.
(regex-quote "cons")
  ⇒ "\Qcons\E"
regex‑quote is useful when building a composite regex from a mix of regex strings and verbatim strings.

Portable Scheme regular expressions

This provides the procedures pregexp, pregexp‑match‑positions, pregexp‑match, pregexp‑split, pregexp‑replace, pregexp‑replace*, and pregexp‑quote.

Before using them, you must require them:

(require 'pregexp)

These procedures have the same interface as the corresponding regex- versions, but take slightly different pattern syntax. The replace commands use “\” instead of “$” to indicate substitutions. Also, pregexp‑split behaves differently from regex‑split if the pattern can match an empty string.

See here for details.