Definition
Lexical analysis is the process performed on a text, say a computer program or a markup language such as HTML, which divides it into lexemes
and converts them to a sequence of tokens
, which are used to feed a parser
. A program that performs lexical analysis is usually called lexer
, scanner
or tokenizer
.
tokens / lexemes
Lexemes are relevant syntactic units of the lexer context, for example for a lexer facing a certain programming language some lexemes could be: 1, "hello", for, ==, variableName, function.
A token is a structure that categorizes a lexeme, it contains an abstract name that represents the lexeme group and a possible value in this case the group is not unique. Example of some tokens (the token is represented in the format , and "#" represents start of comments):
< digit, 1 > # qualquer valor numérico
< literal, "olá" > # valores contidos entre aspas duplas
< for > # os caracteres f,o,r
< comparison, == > # os símbolos <, >, <=, >=, == e !=
< id, variableName > # sequência de letras que representa uma variável
< function > # as letras f,u,n,c,t,i,o,n
difference between parsing
Lexers and parsers are closely linked yet are distinct concepts. The lexer specializes in extracting the relevant portions of the text and transforming them into "meaningful" structures, in this case the tokens. The parser has the function of analyzing the syntactic structure of a text, for example to say if in a given programming language the expression "olá" 1 == "outroliteral"
is syntactically valid or invalid. The connection between both is that the structures produced by the lexer are what parser uses as the source to perform the parsing, that is, the parser works with tokens. This is not required, nothing prevents you from building a parser that performs parsing over pure text, but separating the two tasks has some advantages:
Design simplification. A parser that also performs the work of a lexer is significantly more complex.
Optimization. Separating the two tasks (lexing and parsing) you have more freedom to apply specific optimization techniques for each task.
a lexer in practice
The theory is essential, yet nothing better than real code to aid in understanding. Here I show a handy javascript lexer that handles a subset of CSS, more specifically it deals with this example:
h1 {
color: red;
font-size: 30px;
}
body {
background-color: yellow;
}
div {
margin: 10px;
padding: 10px;
}
The code can be executed and will show the list of generated tokens after processing our target CSS:
// nosso CSS
var css = "h1 { \n\
color: red; \n\
font-size: 30px; \n\
} \n\
\n\
body { \n\
background-color: yellow; \n\
} \n\
\n\
div { \n\
margin: 10px; \n\
padding: 10px; \n\
} \n\
";
/**
* Lista que define nossos tokens e as respectivas expressões regulares que os encontram no texto.
*/
var TOKENS = [
{name: 'EMPTY', regex: /^(\s+)/ },
{name: 'RULE_SET_START', regex: /^({)/ },
{name: 'RULE_SET_END', regex: /^(})/ },
{name: 'RULE_DEFINITION', regex: /^(:)/ },
{name: 'RULE_END', regex: /^(;)/ },
{name: 'SELECTOR', regex: /^([a-zA-Z]+[a-zA-Z0-9]*)(?=\s*{)/ },
{name: 'RULE', regex: /^([a-z][-a-z]+)(?=\s*:)/ },
{name: 'RULE_VALUE', regex: /^(\w+)(?=\s*(;|}))/ }
];
var text = css;
var outputTokenList = [];
while(text !== '') { // enquanto existir texto a ser consumido
var hasMatch = false;
/**
* Iteramos sobre toda a lista de TOKENS até encontrarmos algum cujo padrão bata com o início do nosso texto.
* Quando ocorre um "match" nós adicionamos o lexeme com seu respectivo token na lista de tokens de saída do lexer.
* Caso nenhum padrão bata com o texto uma exceção é lançada imprimindo a linha que contêm o erro.
*
*/
for (var i=0; i<TOKENS.length; i++) {
var obj = TOKENS[i];
var tokenName = obj.name;
var tokenRegex = obj.regex;
var matched = text.match(tokenRegex);
if (!matched) {
continue;
}
hasMatch = true;
var lexeme = matched[1];
// removemos do texto o lexeme encontrado
// para que outro possa ser considerados
// na próxima iteração
text = text.substring(lexeme.length);
if (tokenName in {'SELECTOR': 1, 'RULE': 1, 'RULE_VALUE': 1}) {
// adicionamos tanto o nome do token quanto o lexeme
outputTokenList.push([tokenName, lexeme]);
}
else if (tokenName in {'EMPTY': 1}) {
// discard, não nos importamos com espaços e quebras de linha.
}
else {
// nestes casos o relacionamento entre o nome do token
// e o lexeme é 1<->1 então não precisamos adicionar o lexeme.
outputTokenList.push([tokenName]);
}
break;
};
if (!hasMatch) {
throw 'Invalid pattern at:\n' + (text.substring(0, text.indexOf('\n')));
break;
}
}
outputTokenList.forEach(function(token) {
document.write('< ' + token[0]);
if (token.length > 1) {
document.write(' , ' + token[1]);
}
document.write(' ><br>');
});
I'm not going to go into explanations about lexer implementations because this is an extensive subject and not directly related to the question, so note that this is just an illustration of the possible operation of a lexer, reads target text and generates tokens as output , the code is neither efficient nor robust.