Wisteria - Wisl

Wisl is Wisteria’s frontend for generating a lexer. It is comprised of three parts:

Token definitions
Regular definitions
Token productions

In a Wisl file the parts are separated by --- giving the following file structure:

# Token Definitions

---

# Regular Definitions

---

# Token Productions

To explain each section I will use as a running example a language comprised of parentheses and numbers.

Token Definitions

This is where the tokens that will be produced are defined along with their respective type. So, for our language of parentheses and numbers we would have the following:

LPAREN, RPAREN,
NUMBER(i32)

Tokens can either have no type, such as LPAREN and RPAREN, or they can be given any Rust type. The name of a token must be entirely in uppercase letters. Note that the tokens can be declared on separate lines as shown above, but there must always be a separating comma.

Regular Definitions

Here, we define regular expressions that we may use later when describing the token productions. For our example, we will want to ignore white space, while also match any numbers we encounter. This can be done as follows:

delim      : [\ \t\n]
whitespace : {delim}+
digit      : [0-9]
number     : {digit}+

Each line defines a different regular expression. Regular expressions’ names must be entirely in lowercase letters. The syntax of the regular expressions should be familiar but note that curly braces are used to reference previously defined regular expressions.

Token Productions

Lastly, we define the regular expressions that produce the tokens that we want:

{whitespace} => _

(            => LPAREN
)            => RPAREN
{number}     => NUMBER(_lex.parse::<i32>().unwrap())

To the right of the => we specify the token we want to produce, with _ specifying that we do not wish to produce anything. For tokens that have an internal value, such as NUMBER, we must specify how to convert the content of the regular expression into the appropriate type. Thus, inside the token’s parentheses, using code in Rust, we convert the content of the regular expression bound to the variable _lex into an i32. Note that _lex has type String.

Further Examples

Say we now want to match hexadecimal numbers as well. We can add the following regular definitions and token productions as follows:

hex : [0-9a-fA-F]+

---

(0x|0X){hex} => NUMBER(i32::from_str_radix(&_lex[2..], 16).unwrap())

We can also try to match string literals like "Hello World"!:

STRING(String)

---

escape        : \\b|\\f|\\n|\\r|\\t|\\v|\\'|\\"|\\\\
stringcontent : ([^"\\])|{escape})+

---

"{string}" => STRING(_lex.get(1.._lex.len()-1).unwrap().to_string())

The full file with everything just discussed would look something like:

LPAREN, RPAREN,
NUMBER(i32), STRING(String)

---

delim      : [\ \t\n]
whitespace : {delim}+

digit  : [0-9]
number : {digit}+
hex    : [0-9a-fA-F]+

escape        : \\b|\\f|\\n|\\r|\\t|\\v|\\'|\\"|\\\\
stringcontent : ([^"\\])|{escape})+

---
{whitespace} => _

(            => LPAREN
)            => RPAREN

{number}     => NUMBER(_lex.parse::<i32>().unwrap())
(0x|0X){hex} => NUMBER(i32::from_str_radix(&_lex[2..], 16).unwrap())

"{string}"   => STRING(_lex.get(1.._lex.len()-1).unwrap().to_string())