This tool generates lexical analyzer based on a list of pattern descriptions in terms of regular expressions as an
input. Its input file is very similar to the input of standard lex tool
by structure and syntax. But contrary to lex instead of generating full analyzer with pattern matching inlined
handlers lexegen generates only tables and pattern matching C-compliant function as files, which can be included into
the rest analyzer's implementation.
This README file briefly describes this tool and can be not up-do-date, it also gives guidelines how to use this stuff. For more detailed information see wiki pages.
Assume that we already have built this tool as lexegen executable, and we want to create a simple lexical analyzer,
which extracts the following tokens from an input buffer: C-identifiers, unsigned integers, floating point numbers and
C-strings. Also we want to skip C-comments and whitespace.
Here is a source input file test.lex for this analyzer:
# This is a simple lexical analyzer.
# This section describes named regular expression definitions, which can be
# expanded in further expressions via `{name}` token.
# Each line has the form: <definition-name> <regex>.
# Note that only end-of-line character terminates each regular expression, so
# a comment can't be placed on the same line with the expression.
# Start conditions are also can be defined in this section using the form: %start <sc-name>.
# By default only one `initial` start condition is implicitly defined.
# a digit
dig [0-9]
# a letter
letter [a-zA-Z]
# unsigned integer
int {dig}+
# a C-identifier
id ({letter}|_)({letter}|{dig}|_)*
# a floating-point number
real (({dig}+(\.{dig}*)?)|(\.{dig}+))((e|E)(\+|\-)?{dig}+)?
# white-space character
ws [ \t\r\n]
# C-style comment
comment \/\* ( [^*\\] | \\(.|\n) | \*+([^*/\\]|\\(.|\n)) )* \*+\/
# a C-string
string \" ( [^"\\] | \\(.|\n) )* \"
%% # section separator
# This section describes patterns as regular expressions.
# Each line has the form: <pattern-name> <regex>,
# or the form: <pattern-name> <s1, s2, ...> <regex> with start condition list.
# If start condition list is not specified, the pattern participates in all
# defined start conditions.
comment {comment}
string {string}
id {id}
int {int}
real {real}
ws {ws}+
other .
%% # mandatory end of sectionThen, in order to generate our lexical analyzer let's issue the following:
$ ./lexegen test.lex
test.lex: info: building analyzer...
test.lex: info: - pattern count: 7
test.lex: info: - S-state count: 1
test.lex: info: - position count: 44
test.lex: info: - meta-symbol count: 13
test.lex: info: - state count: 24
test.lex: info: - transition table size: 1248 bytes
test.lex: info: done
test.lex: info: optimizing states...
test.lex: info: - state group count: 19
test.lex: info: - dead group count: 0
test.lex: info: - new state count: 19
test.lex: info: - transition table size: 988 bytes
test.lex: info: done
test.lex: info: compressing tables...
test.lex: info: - total compressed transition table size: 656 bytes
test.lex: info: doneAs the result two files with default names lex_defs.h and lex_analyzer.inl are generated. If it is needed to specify
the names explicitly the following should be issued:
./lexegen test.lex -o <new-analyzer-file-name> --header-file=<new-defs-file-name>File lex_defs.h contains numerical identifiers for patterns and start conditions (or start analyzer states). Only one
sc_initial start condition is defined for our example.
File lex_analyzer.inl contains necessary tables and lex() function implementation, defined as static. This
function has the following prototype:
static int lex(const char* first, const char* last, int** p_sptr, size_t* p_llen, int flags);where:
first- pointer to the first character of input bufferlast- pointer to the character after the last character of input bufferp_sptr- pointer to current user-provided DFA stack pointerp_llen- pointer to current matched lexeme lengthflags- can be a bitwiseorof the following flags:flag_has_morenot to treat the end of input buffer as the end of input sequenceflag_at_beg_of_lineto tell analyzer that it is at the beginning of a new line (activates patterns starting with^)
returns: matched pattern identifier
The analyzer always tries to match one of patterns to the next longest possible chunk of text. If more than one patterns fit this chunk, then pattern priority is used. The first pattern has the highest priority, and the last one has the lowest.
The starting state must be on the top of user-provided DFA stack before calling lex(). Current stack pointer *p_sptr
must point to the position after the last state (the first free stack cell)
After the function returns and the pattern is matched the stack pointer *p_sptr is the same as before calling the
function.
In case if flag_has_more is not specified the stack pointer *p_sptr is always the same as before the calling. If
no user-provided pattern is matched it skips one character and returns predef_pat_default. So, the function always
matches at least one character from the input buffer. If input buffer is empty, it returns err_end_of_input
(negative).
If flag_has_more is specified, then in case of reaching the end of input buffer the analyzer leaves the stack
pointer *p_sptr as it is and returns err_end_of_input. It gives a chance to add more characters to the input
sequence and call lex() function again to continue the analysis. Already analyzed part of input is no more needed. All
necessary information is in the state stack. In theory, the old input buffer can be freed, but in practice it will
likely be needed in future to concatenate the full lexeme.
User-provided DFA stack must have the same count of free cells as the length of the longest possible lexeme, or last - first if we must deal with lexemes of arbitrary length. The other approach is to trim input buffer in case if it is
longer than free range in user-provided DFA stack and to facilitate flag_has_more flag. The probable code will look
like this:
auto state_stack = std::make_unique<int[]>(kInitialStackSize);
int* slast = state_stack.get() + kInitialStackSize;
int* sptr = state_stack.get();
*sptr++ = lex_detail::sc_initial;
...
int pat = 0;
std::size_t llen = 0;
...
while (true) {
const char* trimmed_last = last;
if (slast - sptr < last - first) { trimmed_last = first + static_cast<std::ptrdiff_t>(slast - sptr); }
pat = lex_detail::lex(first, trimmed_last, &sptr, &llen,
trimmed_last != last ? lex_detail::flag_has_more : 0);
if (pat >= lex_detail::predef_pat_default) { break; } // full lexeme is obtained
if (trimmed_last != last) {
// enlarge state stack, update `slast` & `sptr`, and continue analysis
<... enlarge state stack ...>
first = trimmed_last;
} else {
<... end of sequence ...>
}
}
...Note, that it is convenient to use the state stack also as a start condition stack.
#include <cstdint>
#include <cstring>
#include <iostream>
#include <memory>
#include <string_view>
namespace lex_detail {
#include "lex_defs.h"
#include "lex_analyzer.inl"
} // namespace lex_detail
int main(int argc, char* argv[]) {
constexpr std::size_t kInitialStackSize = 64;
if (argc < 2) { return -1; }
const char* first = argv[1]; // the first char pointer
const char* last = argv[1] + std::strlen(argv[1]); // after the last char pointer
auto state_stack = std::make_unique<int[]>(kInitialStackSize);
int* slast = state_stack.get() + kInitialStackSize;
int* sptr = state_stack.get();
*sptr++ = lex_detail::sc_initial;
while (true) {
int pat = 0;
std::size_t llen = 0;
const char* first0 = first;
while (true) {
const char* trimmed_last = last;
if (slast - sptr < last - first) { trimmed_last = first + static_cast<std::ptrdiff_t>(slast - sptr); }
pat = lex_detail::lex(first, trimmed_last, &sptr, &llen,
trimmed_last != last ? lex_detail::flag_has_more : 0);
if (pat >= lex_detail::predef_pat_default) { break; } // full lexeme is obtained
if (trimmed_last == last) { return 0; } // end of sequence
// enlarge state stack, update `slast` & `sptr`, and continue analysis
const std::size_t old_stack_size = static_cast<std::ptrdiff_t>(sptr - state_stack.get());
const std::size_t new_stack_size = 2 * old_stack_size;
auto new_state_stack = std::make_unique<int[]>(new_stack_size);
std::memcpy(new_state_stack.get(), state_stack.get(), old_stack_size * sizeof(*sptr));
slast = new_state_stack.get() + new_stack_size;
sptr = new_state_stack.get() + old_stack_size;
state_stack = std::move(new_state_stack);
first = trimmed_last;
}
std::string_view lexeme(first0, llen);
switch (pat) {
case lex_detail::pat_comment: std::cout << "comment: " << lexeme << std::endl; break;
case lex_detail::pat_string: std::cout << "string: " << lexeme << std::endl; break;
case lex_detail::pat_id: std::cout << "id: " << lexeme << std::endl; break;
case lex_detail::pat_int: std::cout << "int: " << lexeme << std::endl; break;
case lex_detail::pat_real: std::cout << "real: " << lexeme << std::endl; break;
case lex_detail::pat_other: std::cout << "other: " << lexeme << std::endl; break;
default: break;
}
first = first0 + llen;
}
}Build and run result:
$ ./lexegen test.lex
$ gcc example.cpp
$ ./a.out "identifier 100; /*bla-bla-bla*/ 3.1415; \"hello, world\";"
id: identifier
int: 100
other: ;
comment: /*bla-bla-bla*/
real: 3.1415
other: ;
string: "hello, world"
other: ;These rules are used to compose regular expressions for definitions or patterns:
xif this character is not a ' ', FF, CR, HT, or VT matches the character 'x'..any character (byte) except newline (NL)[xyz]a "character class"; in this case, matches 'x', 'y', or 'z'[abj-oZ]a "character class" with a range in it; matches an 'a', a 'b', any letter from 'j' through 'o', or a 'Z'[^A-Z]a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.[^A-Z\n]any character EXCEPT an uppercase letter or a newline{name}the expansion of the "name" definition (see above)"[xyz]\"foo"the literal string:[xyz]"foo\Xif X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*')\123the character with octal value 123\x2athe character with hexadecimal value 2ar*zero or more r's, whereris any regular expressionr+one or more r'sr?zero or one r's (that is, "an optionalr")r{2,5}anywhere from two to five r'sr{2,}two or more r'sr{,2}no more than two r'sr{4}exactly 4 r's(r)match anr; parentheses are used to override precedencersthe regular expressionrfollowed by the regular expressions; called "concatenation"r|seither anror ansr/sanrbut only if it is followed by ans. The text matched bysis included when determining whether this rule is the "longest match", but is then returned to the input. So the returned lexeme is only the text matched byr. This type of pattern is called "trailing context". It should be the top pattern operator, UB otherwise.^ranr, but only at the beginning of a line (i.e., whenflag_at_beg_of_lineflag is specified). It should be the first pattern operator, it is ignored otherwise.!^ranr, but only not at the beginning of a line (i.e., whenflag_at_beg_of_lineflag is not specified). It should be the first pattern operator, it is ignored otherwise.r$anr, but only at the end of a line (i.e., just before a newline). Equivalent tor/\n. Note that the end of input is not treated as a newline character, so it is not the end of a line.
In addition to characters and ranges of characters, character classes can also contain character class expressions.
These are expressions enclosed inside [: and :] delimiters, which themselves must appear between the [ and ] of
the character class. Other elements may occur inside the character class, too. The valid expressions are:
[:alnum:]- alphanumeric characters[A-Za-z0-9][:alpha:]- alphabetic characters[A-Za-z][:blank:]- space and tab[ \t][:cntrl:]- control characters[\x01-\x1F\x7F][:digit:]- digits[0-9][:graph:]- visible characters[\x21-\x7E][:lower:]- lowercase letters[a-z][:print:]- visible characters and the space character[\x20-\x7E][:punct:]- punctuation characters[\]\[!"#$%&'()*+,./:;<=>?@\\^_`{|}~\-][:space:]- whitespace characters[ \t\r\n\v\f][:upper:]- uppercase letters[A-Z][:xdigit:]- hexadecimal digits[A-Fa-f0-9]
For example, the following character classes are all equivalent:
[[:alnum:]][[:alpha:][:digit:]][[:alpha:][0-9]][a-zA-Z0-9]
Note that ' ', FF, CR, HT, or VT characters (bytes) are skipped while parsing regular expressions, use \x20, \f,
\r, \t, or \v instead. Also zero '\0' character (byte) is always treated as not matchable.
$ ./lexegen --help
OVERVIEW: A tool for regular-expression based lexical analyzer generation
USAGE: ./lexegen file [-o <file>] [--header-file=<file>] [--no-case] [--compress <n>]
[--use-int8-if-possible] [-O <n>] [-h] [-V]
OPTIONS:
-o, --outfile=<file> Place the output analyzer into <file>.
--header-file=<file> Place the output definitions into <file>.
--no-case Build case insensitive analyzer.
--compress <n> Set compression level to <n>:
0 - do not compress analyzer table, do not use `meta` table;
1 - do not compress analyzer table;
2 - Default compression.
--use-int8-if-possible Use `int8_t` instead of `int` for states if state count is < 128.
-O <n> Set optimization level to <n>:
0 - Do not optimize analyzer states;
1 - Default analyzer optimization.
-h, --help Display this information.
-V, --version Display version.Perform these steps to build the project:
-
Clone
lexegenrepository and enter it$ git clone https://github.com/gbuzykin/lexegen.git $ cd lexegen -
Initialize and update
uxssubmodule$ git submodule update --init
-
Create compilation script using
cmaketool$ cmake --preset default
Default C++ compiler will be used.
-
Build
lexegen$ cmake --build build --config Release
To run several parallel processes (e.g. 8) for building use
-jkey$ cmake --build build --config Release -j 8
-
Install
lexegen$ cmake --install build --config Release --prefix <install-dir>