What is lexical analysis?

36

I was taking a look at the source code of a known php library, called Twig (it's a template engine, with its own syntax), and I came across classes, interfaces and methods like Lexical , Lex and LexInterface .

I gave one a search and I realized that it was the term lexical analysis .

Although I understood some things, I was confused in others.

For example, I am used to seeing the term Parser or Parsing , when it comes to transforming a given data into another data.

  • What would be the lexical analysis?

Sorry if I'm confused by the question, but I think the community will help me with a good and enlightening answer.

    
asked by anonymous 16.02.2016 / 18:36

6 answers

25

Definition

Lexical analysis is the process performed on a text, say a computer program or a markup language such as HTML, which divides it into lexemes and converts them to a sequence of tokens , which are used to feed a parser . A program that performs lexical analysis is usually called lexer , scanner or tokenizer .

tokens / lexemes

Lexemes are relevant syntactic units of the lexer context, for example for a lexer facing a certain programming language some lexemes could be: 1, "hello", for, ==, variableName, function.

A token is a structure that categorizes a lexeme, it contains an abstract name that represents the lexeme group and a possible value in this case the group is not unique. Example of some tokens (the token is represented in the format , and "#" represents start of comments):

< digit, 1 > # qualquer valor numérico
< literal, "olá" > # valores contidos entre aspas duplas
< for > # os caracteres f,o,r
< comparison, == > # os símbolos <, >, <=, >=, == e !=
< id, variableName > # sequência de letras que representa uma variável
< function > # as letras f,u,n,c,t,i,o,n

difference between parsing

Lexers and parsers are closely linked yet are distinct concepts. The lexer specializes in extracting the relevant portions of the text and transforming them into "meaningful" structures, in this case the tokens. The parser has the function of analyzing the syntactic structure of a text, for example to say if in a given programming language the expression "olá" 1 == "outroliteral" is syntactically valid or invalid. The connection between both is that the structures produced by the lexer are what parser uses as the source to perform the parsing, that is, the parser works with tokens. This is not required, nothing prevents you from building a parser that performs parsing over pure text, but separating the two tasks has some advantages:

  • Design simplification. A parser that also performs the work of a lexer is significantly more complex.
  • Optimization. Separating the two tasks (lexing and parsing) you have more freedom to apply specific optimization techniques for each task.
  • a lexer in practice

    The theory is essential, yet nothing better than real code to aid in understanding. Here I show a handy javascript lexer that handles a subset of CSS, more specifically it deals with this example:

    h1 {
        color: red;
        font-size: 30px;
    }
    
    body {
        background-color: yellow;
    }
    
    div {
        margin: 10px;
        padding: 10px;
    }
    

    The code can be executed and will show the list of generated tokens after processing our target CSS:

    // nosso CSS
    var css = "h1 { \n\
    	color: red; \n\
    	font-size: 30px; \n\
    } \n\
     \n\
    body { \n\
    	background-color: yellow; \n\
    } \n\
     \n\
    div { \n\
    	margin: 10px; \n\
    	padding: 10px; \n\
    } \n\
    ";
    
    /**
    * Lista que define nossos tokens e as respectivas expressões regulares que os encontram no texto.
    */
    var TOKENS = [
    	{name: 'EMPTY', regex: /^(\s+)/ },
    	{name: 'RULE_SET_START', regex: /^({)/ },
    	{name: 'RULE_SET_END', regex: /^(})/ },
    	{name: 'RULE_DEFINITION', regex: /^(:)/ },
    	{name: 'RULE_END', regex: /^(;)/ },
    	{name: 'SELECTOR', regex: /^([a-zA-Z]+[a-zA-Z0-9]*)(?=\s*{)/ },
    	{name: 'RULE', regex: /^([a-z][-a-z]+)(?=\s*:)/ },
    	{name: 'RULE_VALUE', regex: /^(\w+)(?=\s*(;|}))/ }
    ];
    
    
    var text = css;
    var outputTokenList = [];
    while(text !== '') { // enquanto existir texto a ser consumido
    
    	var hasMatch = false;
    	
    	/**
      	 * Iteramos sobre toda a lista de TOKENS até encontrarmos algum cujo padrão bata com o início do nosso texto.
      	 * Quando ocorre um "match" nós adicionamos o lexeme com seu respectivo token na lista de tokens de saída do lexer.
      	 * Caso nenhum padrão bata com o texto uma exceção é lançada imprimindo a linha que contêm o erro.
    	 *
    	 */
    	for (var i=0; i<TOKENS.length; i++) {
    		var obj = TOKENS[i];
    		var tokenName = obj.name;
    		var tokenRegex = obj.regex;
    		
    		var matched = text.match(tokenRegex);
    		if (!matched) {
    			continue;
    		}
    		
    		hasMatch = true;
    		var lexeme = matched[1];
    		
    		// removemos do texto o lexeme encontrado
    		// para que outro possa ser considerados
    		// na próxima iteração
    		text = text.substring(lexeme.length);
    
    		if (tokenName in {'SELECTOR': 1, 'RULE': 1, 'RULE_VALUE': 1}) {
    			// adicionamos tanto o nome do token quanto o lexeme
    			outputTokenList.push([tokenName, lexeme]);
    		}
    		else if (tokenName in {'EMPTY': 1}) {
    			// discard, não nos importamos com espaços e quebras de linha.
    		}
    		else {
    			// nestes casos o relacionamento entre o nome do token
    			// e o lexeme é 1<->1 então não precisamos adicionar o lexeme.
    			outputTokenList.push([tokenName]);			
    		}
    		
    		break;
    	};
    
    	if (!hasMatch) {
    		throw 'Invalid pattern at:\n' + (text.substring(0, text.indexOf('\n')));
    		break;
    	}
    }
    
    outputTokenList.forEach(function(token) {
    	document.write('< ' + token[0]);
        if (token.length > 1) {
    	    document.write(' , ' + token[1]);
        }
    	document.write(' ><br>');
    });

    I'm not going to go into explanations about lexer implementations because this is an extensive subject and not directly related to the question, so note that this is just an illustration of the possible operation of a lexer, reads target text and generates tokens as output , the code is neither efficient nor robust.

        
    20.02.2016 / 03:01
    9
      

    What would be the Lexical Analysis?

    It is the analysis of lines of characters to create symbols that facilitate understanding of what is written there.

      

    Lexical Analysis and Parsing / Parser are the same things, or are they really different things?

    They are complementary.

    The first step is the generation of tokens, called lexical analysis, whereby the input character stream is divided into meaningful symbols defined by a grammar of regular expressions.

    The second step is syntactic analysis, which is to verify that tokens form an allowed expression.

    The final phase is semantic analysis, which elaborates the implications of the validated expression and takes appropriate action.

        
    19.02.2016 / 12:32
    8

    Lexical and parsing (parsing) are in fact very similar things, since the algorithms that work on these two fronts operate in a similar way: the input processed by both is similar and the results which also feature.

    As mentioned in the comments of the question, in fact if there was much this expression when studying automata precisely because lexical analyzers only deal with regular languages . In practice, a lexical analyzer acts as a tokens recognizer. That is, given the grammar of any regular language, a lexical analyzer is able to determine whether a given string is part of that language or not. Hence the famous regular expressions.

    In contrast, parsers act on a more complex level of languages because they deal with context-free languages . Also in practice, parsers usually process a sequence of tokens (generally recognized by lexers) and determine whether such tokens satisfy the structures defined in the grammar of that language.

    Therefore, the lexical analysis consists of validating tokens taking into account the rules of formation of a regular language, if that language is not more regular, then it is a parser. In the case of Twig, because it is a template engine, I believe that the lexical analysis occurs in the recognition of the special markers as {{ , else , {% etc.

    I am updating my answer because I believe that it was not broad enough and also because I think the other responses were either too general or lexed to parsers.

    First the fundamental similarities between lexers and parsers:

  • Both have as input symbols of some alphabet. Generally the symbols of a lexer are ASCII or Unicode characters whereas parsers process tokens (terminal symbols of the grammar they are validating).

  • They analyze these symbols by associating them with a grammar to recognize them as members of a language. As I explained above, lexers only validate regular languages whereas parsers operate in context-free languages . Levels 3 and 2 in the Chomsky hierarchy , respectively.

  • Both output as sentences of the language they are evaluating. Lexers distinguish tokens from a string given as input, whereas parsers generate syntactic trees.

    Regarding the last point, although both are complementary in the vast majority of cases, this does not mean that a parser will always receive its tokenized input from a lexer. Being a parser capable of generating multiple sentences of any L1 language, we can obtain an L2 language whose tokens are L1 sentences. Therefore, parsers can also be tokenizers of other parsers.

    In natural language processing, parsers gets their tokenized input from a series of steps that may involve manual editing, statistical analysis, and machine learning.

    So the fundamental difference between the two is as I have explained above: the level of language in which they operate and consequently the complexity required to operate under this level. While for regular finite-state automata languages are sufficient, context-free languages require more complex mechanisms such as stack automata and all the huge range of existing parsers (bottom-up, top-down, tabular, etc.).

        
  • 18.02.2016 / 21:25
    5

    An interpreter / compiler does not understand "thick text" directly, it needs the data well organized and defined.

    For example, let's assume that our interpreter needs to interpret this simple sum:

    42 + 42
    

    How do we solve this? This is where the role of the lexical analyzer comes in.

    The programmer who created our imaginary interpreter has defined that a number is the set of 1 digit followed by other digits and that sum is simply the symbol '+'.

    [0-9]+ returns NUMBER;
    "+"    returns PLUS;
    " "    ;
    

    What now? Well, let's look at what happens when it parses the input, the end point being the character it is parsing:

    (Our parser contains the number 4 in the buffer, it knows that by definition, a number is 1 digit followed by more digits.)

    p>

    2) 4 . 2 (So far it has 4 as number and continues by storing 2 in the buffer.)

    3) 42 . (This is the end of the number, our interpreter returns NUMBER.) We find a blank space, which by definition does not return anything. p>

    For now, we know this:

    <NUMBER, 42> . + 42
    

    The parser is on the '+', it knows that by definition it is a PLUS:

    <NUMBER, 42> <PLUS, '+'> . 42
    

    And it's the same process on the 42:

    <NUMBER, 42> <PLUS, '+'> <NUMBER, 42>
    

    The analysis has been completed, now what? Our interpreter can interpret this data consistently. Let's suppose that our interpreter uses a very simple and restrictive grammar for sums:

    sum -> NUMBER PLUS NUMBER
    

    It can use the lexical parser output tokens and focus only on parsing. Because the output consists of NUMBER PLUS NUMBER, it fits in as a valid sum.

        
    24.02.2016 / 18:02
    4

    The answers are very good, and have covered the whole technical part of the subject. Therefore, just by complementing the information, it follows an explanation focused only on the etymology of words.

    Lexical Analysis

    Refers to language vocabulary (words) . In a simple way, it is a dictionary analysis and verifies the existence of terms / vocabulary within the language.
    For example "carnival", "egg", "ball" are part of the lexicon of the Portuguese language. The words "party", "egg", "ball" are part of the lexicon of the English language. Lexical analysis does not concern itself with order or meaning of the terms, but only with the terms themselves.

      

    You can see a technical example here / a>.

    Synthetic Analysis

    Refers to the grammatical rules of the language , that is, how we can organize the terms of the language to create meaning. It can be viewed as the way a command should be structured to perform an action, or the rules for forming a sentence.

      

    Technical example here .

    Semantic Analysis

    Here we are talking about the meaning / meaning used in the phrase / command . For example, we can use the phrase Me inclui fora dessa , where words and syntax are correct but not semantically.

      

    Technical example here .

    These definitions are part of linguistics as a whole and are used as a way to organize and facilitate the understanding of how a code / process proposes to work.

    I chose not to use a technical approach and to pass examples of the Portuguese language, because they are equally valid when you think of programming languages and make understanding of terms simpler.

        
    22.02.2016 / 18:45
    4

    You've said everything; however I like language processors, and I can not resist putting here an example in this case Lex + Yacc.

    Statement: given a CSS (simplified, I will take as an example the case presented by @BrunoRP) calculate how many properties each tag has.

    Translator grammar: grammar + semantic actions

    Syntactic parser = parser (yacc)

    %{
    #include <stdio.h>
    %}
    
    %union {char *s; int i;}
    %token <s> ID STR MEA
    %type  <i> props
    
    %%
    css  :                                      // bloco*
         | css bloco
         ;
    bloco: ID '{' props '}'  { printf("%s - %d\n",$1,$3);}
         ;
    props:                   { $$ = 0    ;}     // prop*
         | props prop        { $$ = $1+1 ;}     // contar as prop
         ;
    prop : ID ':' v ';'      ;                  // propriedade
    v    : ID | STR | MEA    ;                  // valor
    %%
    #include "lex.yy.c"
    
    yyerror(char *s){fprintf(stderr,"ERRO (linha %d):'%s'\n-%s",yylineno, yytext,s);}
    
    int main() { yyparse(); return 0; }
    

    Lexical analysors (flex)

    %option yylineno
    %option noyywrap
    
    %%
    [ \n\t]+              {  } // ignorar espaços
    #.*                   {  } // ignorar comentários
    
    [{}:;]                { return yytext[0];} // caracteres especiais
    
    [a-zA-Z][-a-zA-Z0-9]* { yylval.s=strdup(yytext);  return ID;  }
    \"[^\"]*\"            { yylval.s=strdup(yytext);  return STR; }
    [0-9]+(px|pt|mm)      { yylval.s=strdup(yytext);  return MEA; }
    
    .  {fprintf(stderr,"%d: Erro - invalid character '%c'\n",yylineno,yytext[0]);}
    
    %%
    

    What is lexical analysis:

  • skip the spaces and comments
  • return reserved word codes and special characters
  • group the characters that form the IDs, etc; return your code and value
  • deal with lexical errors
  • Compiling and testing:

    $ flex proc.l
    $ yacc proc.y
    $ cc -o proc y.tab.c
    $ proc < ex.css      ##ex.css é o exemplo de css do @BrunoRP
    
    h1 - 2
    body - 1
    div - 2
    
        
    25.02.2016 / 23:39