What is lexical analysis?

Question

What is lexical analysis?

Navigation

#1 by (25 votes)
#2 by (9 votes)
#3 by (8 votes)
#4 by (5 votes)
#5 by (4 votes)
#6 by (4 votes)

36

I was taking a look at the source code of a known php library, called Twig (it's a template engine, with its own syntax), and I came across classes, interfaces and methods like Lexical , Lex and LexInterface .

I gave one a search and I realized that it was the term lexical analysis .

Although I understood some things, I was confused in others.

For example, I am used to seeing the term Parser or Parsing , when it comes to transforming a given data into another data.

What would be the lexical analysis?

Sorry if I'm confused by the question, but I think the community will help me with a good and enlightening answer.

terminologia parser analise-léxica

asked by anonymous 16.02.2016 / 18:36

6 answers

9

What would be the Lexical Analysis?

It is the analysis of lines of characters to create symbols that facilitate understanding of what is written there.

Lexical Analysis and Parsing / Parser are the same things, or are they really different things?

They are complementary.

The first step is the generation of tokens, called lexical analysis, whereby the input character stream is divided into meaningful symbols defined by a grammar of regular expressions.

The second step is syntactic analysis, which is to verify that tokens form an allowed expression.

The final phase is semantic analysis, which elaborates the implications of the validated expression and takes appropriate action.

19.02.2016 / 12:32

8

Lexical and parsing (parsing) are in fact very similar things, since the algorithms that work on these two fronts operate in a similar way: the input processed by both is similar and the results which also feature.

As mentioned in the comments of the question, in fact if there was much this expression when studying automata precisely because lexical analyzers only deal with regular languages . In practice, a lexical analyzer acts as a tokens recognizer. That is, given the grammar of any regular language, a lexical analyzer is able to determine whether a given string is part of that language or not. Hence the famous regular expressions.

In contrast, parsers act on a more complex level of languages because they deal with context-free languages . Also in practice, parsers usually process a sequence of tokens (generally recognized by lexers) and determine whether such tokens satisfy the structures defined in the grammar of that language.

Therefore, the lexical analysis consists of validating tokens taking into account the rules of formation of a regular language, if that language is not more regular, then it is a parser. In the case of Twig, because it is a template engine, I believe that the lexical analysis occurs in the recognition of the special markers as {{ , else , {% etc.

I am updating my answer because I believe that it was not broad enough and also because I think the other responses were either too general or lexed to parsers.

First the fundamental similarities between lexers and parsers:

Both have as input symbols of some alphabet. Generally the symbols of a lexer are ASCII or Unicode characters whereas parsers process tokens (terminal symbols of the grammar they are validating).

They analyze these symbols by associating them with a grammar to recognize them as members of a language. As I explained above, lexers only validate regular languages whereas parsers operate in context-free languages . Levels 3 and 2 in the Chomsky hierarchy , respectively.

Both output as sentences of the language they are evaluating. Lexers distinguish tokens from a string given as input, whereas parsers generate syntactic trees.

Regarding the last point, although both are complementary in the vast majority of cases, this does not mean that a parser will always receive its tokenized input from a lexer. Being a parser capable of generating multiple sentences of any L1 language, we can obtain an L2 language whose tokens are L1 sentences. Therefore, parsers can also be tokenizers of other parsers.

In natural language processing, parsers gets their tokenized input from a series of steps that may involve manual editing, statistical analysis, and machine learning.

So the fundamental difference between the two is as I have explained above: the level of language in which they operate and consequently the complexity required to operate under this level. While for regular finite-state automata languages are sufficient, context-free languages require more complex mechanisms such as stack automata and all the huge range of existing parsers (bottom-up, top-down, tabular, etc.).

18.02.2016 / 21:25

5

An interpreter / compiler does not understand "thick text" directly, it needs the data well organized and defined.

For example, let's assume that our interpreter needs to interpret this simple sum:

42 + 42

How do we solve this? This is where the role of the lexical analyzer comes in.

The programmer who created our imaginary interpreter has defined that a number is the set of 1 digit followed by other digits and that sum is simply the symbol '+'.

[0-9]+ returns NUMBER;
"+"    returns PLUS;
" "    ;

What now? Well, let's look at what happens when it parses the input, the end point being the character it is parsing:

(Our parser contains the number 4 in the buffer, it knows that by definition, a number is 1 digit followed by more digits.)

p>

2) 4 . 2 (So far it has 4 as number and continues by storing 2 in the buffer.)

3) 42 . (This is the end of the number, our interpreter returns NUMBER.) We find a blank space, which by definition does not return anything. p>

For now, we know this:

<NUMBER, 42> . + 42

The parser is on the '+', it knows that by definition it is a PLUS:

<NUMBER, 42> <PLUS, '+'> . 42

And it's the same process on the 42:

<NUMBER, 42> <PLUS, '+'> <NUMBER, 42>

The analysis has been completed, now what? Our interpreter can interpret this data consistently. Let's suppose that our interpreter uses a very simple and restrictive grammar for sums:

sum -> NUMBER PLUS NUMBER

It can use the lexical parser output tokens and focus only on parsing. Because the output consists of NUMBER PLUS NUMBER, it fits in as a valid sum.

24.02.2016 / 18:02

4

The answers are very good, and have covered the whole technical part of the subject. Therefore, just by complementing the information, it follows an explanation focused only on the etymology of words.

Lexical Analysis

Refers to language vocabulary (words) . In a simple way, it is a dictionary analysis and verifies the existence of terms / vocabulary within the language.
For example "carnival", "egg", "ball" are part of the lexicon of the Portuguese language. The words "party", "egg", "ball" are part of the lexicon of the English language. Lexical analysis does not concern itself with order or meaning of the terms, but only with the terms themselves.

You can see a technical example here / a>.

Synthetic Analysis

Refers to the grammatical rules of the language , that is, how we can organize the terms of the language to create meaning. It can be viewed as the way a command should be structured to perform an action, or the rules for forming a sentence.

Technical example here .

Semantic Analysis

Here we are talking about the meaning / meaning used in the phrase / command . For example, we can use the phrase Me inclui fora dessa , where words and syntax are correct but not semantically.

Technical example here .

These definitions are part of linguistics as a whole and are used as a way to organize and facilitate the understanding of how a code / process proposes to work.

I chose not to use a technical approach and to pass examples of the Portuguese language, because they are equally valid when you think of programming languages and make understanding of terms simpler.

22.02.2016 / 18:45

4

You've said everything; however I like language processors, and I can not resist putting here an example in this case Lex + Yacc.

Statement: given a CSS (simplified, I will take as an example the case presented by @BrunoRP) calculate how many properties each tag has.

Translator grammar: grammar + semantic actions

Syntactic parser = parser (yacc)

%{
#include <stdio.h>
%}

%union {char *s; int i;}
%token <s> ID STR MEA
%type  <i> props

%%
css  :                                      // bloco*
     | css bloco
     ;
bloco: ID '{' props '}'  { printf("%s - %d\n",$1,$3);}
     ;
props:                   { $$ = 0    ;}     // prop*
     | props prop        { $$ = $1+1 ;}     // contar as prop
     ;
prop : ID ':' v ';'      ;                  // propriedade
v    : ID | STR | MEA    ;                  // valor
%%
#include "lex.yy.c"

yyerror(char *s){fprintf(stderr,"ERRO (linha %d):'%s'\n-%s",yylineno, yytext,s);}

int main() { yyparse(); return 0; }