Recognize word repetitions in String

Question

Recognize word repetitions in String

Navigation

#1 by (15 votes)
#2 by (9 votes)
#3 by (9 votes)
#4 by (7 votes)
#5 by (6 votes)
#6 by (4 votes)
#7 by (4 votes)
#8 by (4 votes)

17

I have text within StringBuffer and I need to check and mark the words that appear more than once. At first I used a circular queue of 10 positions, because I'm interested only in words repeated in a "ray" of 10 words. It is worth noting that the marking of repeated words can only occur if the repeated words are within a 10-word radius between them. If repeated words are "away" from more than 10 words, they should not be marked. The Contem method returns null if there is no repetition or returns the word that has repetition. String is only the variable that contains the full text.

StringBuffer stringProximas = new StringBuffer();
String has = "";
Pattern pR = Pattern.compile("[a-zA-Zà-úÀ-Ú]+");
Matcher mR = pR.matcher(string);
while(mR.find()){
  word = mR.group();
  nextWord.Inserir(word);//inserir na lista
  has = nextWord.Contem();//verifica se há palavras iguais na lista
  //um if pra verificar se has é null ou nao
  //e aqui marca a palavra repetida, se has for diferente de null
  mR.appendReplacement(stringProximas, "");
  stringProximas.append(has);
}
public void Inserir(String palavra){
    if(this.list[9].equals("null")){
        if(this.list[0].equals("null")){
            this.list[this.fim]=palavra;
        }else{
            this.fim++;
            this.list[this.fim] = palavra;
        }
    }else{
        //inverte o apontador fim para a posição 0
        if(this.inicio == 0 && this.fim == 9){
            this.inicio++;
            this.fim = 0;
            this.list[this.fim] = palavra;
        }else if(this.inicio == 9 && this.fim == 8){//inverte o apontador inicio para posição 0
            this.inicio = 0;
            this.fim++;
            this.list[this.fim] = palavra;
        }else{
            this.inicio++;
            this.fim++;
            this.list[this.fim] = palavra;                    
        }
    }
}
public String Contem() throws Exception{
    for(int i=0;i<this.list.length;i++){
        for(int j=i+1;j<this.list.length;j++){
            if(this.list[i].equals(this.list[j]) && (!this.list[i].equals("null") || !this.list[j].equals("null"))){
                //nao pegar a mesma repetição mais de uma vez
                if(!this.list[i].equals("?")){
                    this.list[i] = "?";//provavelmente será retirado isso
                    return this.list[j];
                }
            }
        }
    }
    return "null";
}

My big problem: If I find repeated words, I can only mark the second occurrence because even the first one is in the queue, the variable word will be the second one and because while I can not mark the second. p>

I'm using this text as an example:
Nowadays, you have to be smart. Our day to day is complicated.
The method should return for example (I put it in bold here, but it's not necessarily the way it's marked):
Today in day , it is necessary to be smart. Our day is day is complicated .

java algoritmo stringbuffer

asked by anonymous 06.02.2015 / 20:19

8 answers

15

Solution:

Using regular expressions you can solve with a very expressive, small code with few if s - in fact only if and only 1 loop :

public String assinalaRepetidas(String texto, String marcadorInicio, 
                                            String marcadorFim, int qtdPalavrasAnalisar) {

    String palavraInteiraPattern = "\p{L}+"; 
    Pattern p = Pattern.compile(palavraInteiraPattern);
    Matcher matcher = p.matcher(texto);

    ArrayList<String> palavras = new ArrayList<String>();
    ArrayList<String> palavrasRepetidas = new ArrayList<String>();

    while (matcher.find() && palavras.size() < qtdPalavrasAnalisar) {

        String palavra = matcher.group();

        if (palavras.contains(palavra) && !palavrasRepetidas.contains(palavra)) {
            texto = texto.replaceAll(
                    String.format("\b%s\b", palavra), 
                    String.format("%s%s%s", marcadorInicio, palavra, marcadorFim));

            palavrasRepetidas.add(palavra);
        }
        palavras.add(palavra);
    }
    return texto;
}

And that's it! End.

Below, some explanation and also the consumer code.

Explaining the solution:

I used regular expression to get every word in the text, ignoring spaces, parentheses, symbols, commas and other punctuations that are not real words. The regular expression for doing this in Java in accented text (using unicode UTF-8 ) is < strong> \ p {L} + .

In the same loop that I get the words found by the regular expression, I already replace the word it repeated itself, wrapping it around the markers.

The consumer code (unit test) looks like this:

@Test
public void assinalaPrimeirasPalavrasRepetidas() {
  String texto = "Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";
  String esperado = "Hoje em [dia], é necessário ser esperto. O nosso [dia] a [dia] é complicado.";

  assertEquals(esperado, new AnalisaTexto().assinalaRepetidas(texto, "[", "]", 10));
}

Although the question describes that you want only the first 10 words, the expected result example seems to consider all of them. So I added a signature that dispenses the "ray" of words to parse:

public String assinalaPalavrasRepetidas(String texto, String marcadorInicio, String marcadorFim) {
    return assinalaRepetidas(texto, marcadorInicio, marcadorFim, Integer.MAX_VALUE);
}

Using this other method, as more than 10 words are parsed, "is" is also identified as repeated:

@Test
public void assinalaTodasPalavrasRepetidas() {
  String texto = "Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";
  String esperado = "Hoje em [dia], [é] necessário ser esperto. O nosso [dia] a [dia] [é] complicado.";

  assertEquals(esperado, new AnalisaTexto().assinalaPalavrasRepetidas(texto, "[", "]"));
}

Finally, note that I also used regular expressions when replacing words with their assigned equivalents. Note the regex method in texto.replaceAll . Otherwise, a part of another matching word would also be ticked. For example, in "being a server" would be flagged.

The test that proves the effectiveness of this little care is:

@Test
public void assinalaApenasPalavraInteira() {

    String texto = "Hoje em dia, pode ser necessário servir ao ser esperto.";
    String esperado = "Hoje em dia, pode [ser] necessário servir ao [ser] esperto.";

    assertEquals(esperado, new AnalisaTexto().assinalaPalavrasRepetidas(texto, "[", "]"));
}

12.02.2015 / 19:18

9

The code itself is explanatory. Basically what I did was:

Create a list of repeated words;

Scroll through this list and fetch each word from the list in the original text;

Change the word for itself but with the markup.

I made markup on text with <b></b> around repeated words.

Having the list of duplicates becomes easy, because the replace() function does almost all work: search the words in the original text and change by marking.

Home

Ideone ide = new Ideone();
StringBuffer texto = new StringBuffer("Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.");
List<String> palavrasRepetidas = ide.pegaRepetidas(texto);

String saida = texto.toString().replace(palavrasRepetidas.get(0), "<b>"+palavrasRepetidas.get(0)+"</b>");

for (int i=1; i < palavrasRepetidas.size(); i++) {
  saida = saida.replace(palavrasRepetidas.get(i), "<b>"+palavrasRepetidas.get(i)+"</b>");
}
System.out.println(saida);

Repeat ()

/**Retorna uma lista com as palavras que aparecem mais de uma vez no texto*//
private static List<String> pegaRepetidas(StringBuffer texto) {
    String textoFormatado = texto.toString().replaceAll("[,.!]", ""); //Retira pontos e vírgulas
    StringTokenizer st = new StringTokenizer(textoFormatado);

    List<String> palavrasRepetidas = new ArrayList<>();

    while (st.hasMoreTokens()) {
        String palavra = st.nextToken();
        if (contaPalavra(palavra, textoFormatado) > 1) { // Se palavra aparece mais de uma vez
            if ( !palavrasRepetidas.contains(palavra) ) { // Se ela ainda não se encontra na lista de repetidas
                palavrasRepetidas.add(palavra);
            }
        }
    }

    return palavrasRepetidas;
}

accountwords ()

/** Retorna o número de vezes que a 'palavra' aparece no 'texto' */
private static int contaPalavra(String palavra, String texto) {
    StringTokenizer st = new StringTokenizer(texto);
    int count = 0;
    while (st.hasMoreTokens()) {
        if (st.nextToken().compareTo(palavra) == 0) {
            count++;
        }
    }

    return count;
}

See working at ideone: link

10.02.2015 / 12:04

9

I decided to reimplement it from scratch. The reasons:

Do not rely on custom list implementations;
Do not depend on% s with special and arbitrary values;
Do not need hardcoded numbers or complicated math;
Do not need regular expressions where a simpler approach solves;
Delegate to Java itself determine what a letter is or not according to its implementation of the Unicode standard.

And here's the result. Explanatory comments in the code:

import java.util.LinkedHashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Locale;
import java.util.Set;

/**
 * @author Victor
 */
public class BuscaPalavras {

    private static final Locale PT_BR = new Locale("PT", "BR");

    public static Set<String> palavrasRepetidas(int raio, String texto) {
        // Palavras repetidas dentro do raio já encontradas. Usa um Set para eliminar duplicatas automaticamente.
        Set<String> palavrasRepetidas = new LinkedHashSet<>(50);

        // Lista contendo as últimas palavras encontradas. O tamanho máximo da lista é igual ao raio.
        List<String> ultimasPalavras = new LinkedList<>();

        // Usa para guardar a plavra que está se formanado a medida que os caracteres são lidos.
        StringBuilder palavra = null;

        // Transforma o texto todo em um array.
        char[] caracteres = texto.toCharArray();
        int tamanho = caracteres.length;

        // Itera cada posição do array até uma depois da última (importante).
        for (int i = 0; i <= tamanho; i++) {
            // Se o caractere i do texto for uma letra, acrescenta ele no StringBuilder.
            // Se estiver na posição depois da última, não entrará no if e seguirá para o else-if.
            if (i < tamanho && Character.isLetter(caracteres[i])) {
                // Cria o StringBuilder caso a palavra esteja começando agora.
                if (palavra == null) palavra = new StringBuilder(20);
                palavra.append(caracteres[i]);

            // Caso contrário, se uma palavra acabou de ser encerrada...
            } else if (palavra != null) {
                // Retira do StringBuilder e converte para maiúsculas.
                String novaPalavra = palavra.toString().toUpperCase(PT_BR);

                // Se for uma das últimas palavras de acordo com o raio, acrescenta na lista de palavras repetidas.
                if (ultimasPalavras.contains(novaPalavra)) palavrasRepetidas.add(novaPalavra);

                // Faz a lista de últimas palavras andar.
                if (ultimasPalavras.size() >= raio) ultimasPalavras.remove(0);
                ultimasPalavras.add(novaPalavra);

                // Terminou a palavra. Volta para null para que outra palavra se inicie depois.
                palavra = null;
            }
        }
        return palavrasRepetidas;
    }

    // Para testar o método palavrasRepetidas.
    public static void main(String[] args) {
        String texto = "O rato roeu a roupa do rei de Roma e a rainha roeu o resto."
                + " Quem mafagafar os mafagafinhos bom amafagafigador será."
                + " Será só imaginação? Será que nada vai acontecer? Será que é tudo isso em vão?"
                + " Será que vamos conseguir vencer? Ô ô ô ô ô ô, YEAH!"
                + " O pato faz Quack-quack!"
                + " Quem é que para o time do Pará?";

        System.out.println(palavrasRepetidas(10, texto));
    }
}

Method output String :

[A, ROEU, SERÁ, QUE, Ô, QUACK, O]

06.02.2015 / 23:17

7

I'm not very good at java, so I accept patches in this code, but I think I got the result you want:

See it running here in Ideone:

link

import java.util.*;
import java.lang.*;
import java.io.*;

enum TokenKind
{
    WordSeparator,
    Word
}

class Token
{
    int _start;
    int _end;
    String _text;
    TokenKind _kind;

    public String getText()
    {
        return _text;
    }

    public void setText(String value)
    {
        _text = value;
    }

    public void setStart(int value)
    {
        _start = value;
    }

    public void setEnd(int value)
    {
        _end = value;
    }

    public int getStart()
    {
        return _start;
    }

    public int getEnd()
    {
        return _end;
    }

    public TokenKind getKind()
    {
        return _kind;
    }

    public void setKind(TokenKind value)
    {
        _kind = value;
    }
}

class LinearRepeatSearchLexer
{
    StringBuffer _text;
    int _position;
    char _peek;

    public LinearRepeatSearchLexer(StringBuffer text)
    {
        _text = text;
        _position = 0;
        _peek = (char)0;
    }

    public Token nextToken()
    {
        Token ret = new Token();
        char peek = PeekChar();

        if(isWordSeparator(peek))
        {
            ret.setStart(_position);
            readWordSeparator(ret);
            ret.setEnd(_position - 1);
            return ret;
        }
        else if(isLetterOrDigit(peek))
        {
            ret.setStart(_position);
            readWord(ret);
            ret.setEnd(_position - 1);
            return ret;
        } 
        else if(peek == (char)0)
        {
            return null;
        }
        else
        {
            // TODO: 
            //  caracteres não identificados
            //  ou você pode simplificar o readWord
            return null;
        }
    }

    void readWordSeparator(Token token)
    {
        char c = (char)0;
        StringBuffer tokenText = new StringBuffer();
        while(isWordSeparator(c = PeekChar()))
        {
            tokenText.append(c);
            MoveNext(1);
        }
        token.setText(tokenText.toString());
        token.setKind(TokenKind.WordSeparator);
    }

    void readWord(Token token)
    {
        char c = (char)0;
        StringBuffer tokenText = new StringBuffer();
        while(isLetterOrDigit(c = PeekChar()))
        {
            tokenText.append(c);
            MoveNext(1);
        }
        token.setText(tokenText.toString());
        token.setKind(TokenKind.Word);
    }

    boolean isWordSeparator(char c)
    {
        // TODO: outros separadores aqui
        return c == ' ' ||
            c == '\t' ||
            c == '\n' ||
            c == '\r' ||
            c == ',' ||
            c == '.' || 
            c == '-' || 
            c == ';' || 
            c == ':' ||
            c == '=' ||
            c == '>';
    }

    boolean isLetterOrDigit(char c)
    {
        // TODO: outras letras aqui
        return (c >= 'a' && c <= 'z') ||
            (c >= 'A' && c <= 'Z') ||
            (c >= '0' && c <= '9') ||
            (c >= 'à' && c <= 'ú') ||
            (c >= 'À' && c <= 'Ú') ||
            c == '_';
    }

    char PeekChar()
    {
        if(_position < _text.length())
            return _text.charAt(_position);
        return (char)0;
    }

    void MoveNext(int plus)
    {
        _position += plus;
    }
}

class LinearRepeatSearch
{
    StringBuffer _text;
    int _radius;

    public LinearRepeatSearch(StringBuffer text, int radius)
    {
        _text = text;
        _radius = radius;
    }

    public LinearRepeatSearch(String text, int radius)
    {
        this(new StringBuffer(text), radius);   
    }

    public List<Token> getRepeatedWords()
    {
        // ler todos os tokens
        ArrayList<Token> ret = new ArrayList<Token>();
        ArrayList<Token> readed = new ArrayList<Token>();
        LinearRepeatSearchLexer lexer = new LinearRepeatSearchLexer(_text);
        Token token = null;
        while((token = lexer.nextToken()) != null)
        {
            if(token.getKind() == TokenKind.Word)
                readed.add(token);
        }

        // localizar repetições a partir do raio
        // PERF:
        //      este laço pode ser melhorado em termos de performance
        //      pois há comparações repetidas aqui
        int size = readed.size();
        for(int x = 0; x < size; x++)
        {
            Token a = readed.get(x);
            for(int y = Math.max(0, x - _radius); y < size && (y - x) < _radius; y++)
            {
                if(x == y) continue;
                Token b = readed.get(y);
                if(a.getText().equals(b.getText()))
                {
                    ret.add(a);
                    break;
                }
            }
        }

        return ret;
    }
}

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        // your code goes here

        StringBuffer input = new StringBuffer("Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.");
        StringBuffer output = new StringBuffer();
        LinearRepeatSearch searcher = new LinearRepeatSearch(input, 10);
        List<Token> spans = searcher.getRepeatedWords();
        int listSize = spans.size();
        int position = 0;
        for(int x = 0; x < listSize; x++)
        {
            Token item = spans.get(x);
            output.append(input.substring(position, item.getStart()));
            output.append("<b>");
            output.append(item.getText());
            output.append("</b>");
            position = item.getEnd() + 1;
        }
        if(position < input.length())
        {
            output.append(input.substring(position));
        }
        System.out.println(output.toString());
    }
}

Code Result:

Today day , it is necessary to be smart. Our day day is complicated.

10.02.2015 / 14:12

6

A fairly simple solution *, using split and two sets:

public static String marcarRepetidas(String s, String prefixo, String sufixo) {
    Set<String> palavras = new HashSet<String>();
    Set<String> palavrasRepetidas = new HashSet<String>();

    // Acha o conjunto de palavras repetidas
    for ( String palavra : s.split("[^a-zA-Zà-úÀ-Ú]+") ) {
        palavra = palavra.toLowerCase();
        if ( palavra.length() > 0 && palavras.contains(palavra) )
            palavrasRepetidas.add(palavra);
        palavras.add(palavra);
    }

    // Marca cada uma dessas palavras no texto (envolvendo-as num prefixo e sufixo)
    for ( String palavra : palavrasRepetidas )
        s = s.replaceAll("(?<![a-zA-Zà-úÀ-Ú])(?iu)(" + palavra + ")(?![a-zA-Zà-úÀ-Ú])",
                         prefixo + "$1" + sufixo);

    // No Java 8 é mais simples (uma única chamada do replaceAll):
    // String juncao = String.join("|", palavrasRepetidas);
    // s = s.replaceAll("(?<![a-zA-Zà-úÀ-Ú])(?iu)(" + juncao + ")(?![a-zA-Zà-úÀ-Ú])",
    //                  prefixo + "$1" + sufixo);

    return s;
}

Example on Ideone . This replaceAll at the end deserves an explanation: before replacing a word in the text, it is important to make sure that it is really a word, not a substring of another word. For this I used two negative lookarounds, one to see if it is not preceded by one letter, and another to see if it is not successful. The (?iu) is to ignore the capitalization, and the capture group is for the word to be replaced by the checked version but without changing its capitalization. Example:

Nowadays, you have to be smart. Our day to day life is complicated. Ze. diaphragm. Day. Yes. Day.

Output:

Today day , it is necessary to be smart. Our day a day is complicated. Ze. diaphragm. Day . It's . Day .

* This answer aims at simplicity, not efficiency; a "manual" method (ie where each costly API call - such as regexes - was replaced with an explicit and then optimized loop), taking advantage of StringBuilder , etc. could perform better if that requirement is important in your case particular.

13.02.2015 / 04:29

4

Hello. I made a very simple implementation.

Essentially words are counted and a map is populated with words and number of occurrences. I used the word as a key and the quantity as value.

Then I scanned the map, overwriting words that occur more than once.

In Java 8 the code would be cleaner and, of course, the implementation could be improved.

See the code:

import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.StringTokenizer;


public class WordCount {
    Map<String, Integer> counter = new HashMap<String, Integer>();

    /**
     * @param args
     */
    public static void main(String[] args) {
        new WordCount().count();
    }

    private void count() {
        String string = "Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";

        StringTokenizer token = new StringTokenizer(string, " .,?:"); //caracateres que não interessam

        while (token.hasMoreTokens()) {
            String s = token.nextToken();
            count(s);
            System.out.println(s);
        }

        System.out.println(counter);
        print(string);
    }

    private void count(String s) {
        Integer i = this.counter.get(s);
        this.counter.put(s, i == null ? 0 : ++i);
    }

    private void print(String s) {
        for (Entry<String, Integer> e : this.counter.entrySet()) {
            if (e.getValue() > 0) {
                s = s.replaceAll(e.getKey(), String.format("<b>%s</b>", e.getKey()));
            }
        }

        System.out.println(s);
    }
}

12.02.2015 / 16:41

4

Okay, okay, okay, it's not Java ...

#!/usr/bin/perl
use strict;
use utf8::all;

my %conta;
my $s="Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";

for($s=~ m{(\w+)}g ) { $conta{$_}++ }
$s =~ s{(\w+)}{ if($conta{$1}>1){ "<b>$1</b>"} else {"$1"}}eg;

print $s;

12.02.2015 / 20:58

Difference between while and for What is HTTP response splitting?

score 4 · Accepted Answer

Algorithm:

In this solution I have broken all the text in tokens . Each token is either a word or anything else between words (spaces, punctuations and other symbols).

Then I scroll through the tokens verifying that each of them is a word that already exists between the last words read, and the amount of last words to compare is limited by the specified radius.

If the token matches a repeated word within the radius, I mark both this token I am reading now as well as that repeated word that was already there.

Finally, I go through all the tokens again, reconstructing the original text and marking the words whose tokens had been marked as repeated words.

Code:

public static String assinalaPalavrasRepetidasEmUmRaio(String texto,
        String marcadorInicio, String marcadorFim, int qtdPalavrasRaio) {

    List<Token> tokens = new ArrayList<Token>();
    List<Token> palavrasNoRaio = new ArrayList<Token>();
    String palavraeNaoPalavraPattern = "\p{L}+|[^\p{L}]+";
    Matcher matcher = Pattern.compile(palavraeNaoPalavraPattern).matcher(texto);

    while (matcher.find()) {
        Token token = new Token(matcher.group());
        tokens.add(token);
        if (token.isPalavra() && palavrasNoRaio.contains(token)) {
            palavrasNoRaio.get(palavrasNoRaio.indexOf(token)).assinala();
            token.assinala();
        }
        if (token.isPalavra()) {
            palavrasNoRaio.add(token);
        }
        if (palavrasNoRaio.size() > qtdPalavrasRaio) {
            palavrasNoRaio.remove(0);
        }
    }
    StringBuilder textoReconstruido = new StringBuilder();
    for (Token token : tokens) {
        if (token.isAssinalado()) {
            textoReconstruido.append(marcadorInicio + token + marcadorFim);
        } else {
            textoReconstruido.append(token);
        }
    }
    return textoReconstruido.toString();
}

Token Class:

As noted in the code above, the Token itself knows whether or not it is a word, and also has a flag indicating whether it has been flagged.

class Token {
    private final String texto;
    private final boolean isPalavra;
    private boolean isAssinalado;

    public Token(String texto) {
        isPalavra = texto.matches("\p{L}+");
        this.texto = texto;
    }
    public boolean isPalavra() {
        return isPalavra;
    }
    public void assinala() {
        isAssinalado = true;
    }
    public boolean isAssinalado() {
        return isAssinalado;
    }
    @Override
    public int hashCode() {
        return texto.hashCode();
    }
    @Override
    public boolean equals(Object obj) {
        if (obj == null || !(obj instanceof Token)) {
            return false;
        }
        return texto.equalsIgnoreCase(((Token)obj).texto);
    }
    @Override
    public String toString() {
        return texto;
    }
}

The hashCode and equals methods are not consumed directly by my code, but they are used by the Java implementation for list.contains and list.indexOf , where hashCode helps to accelerate fetching, and equals is the comparison to know if the item matches is searching.

There are several techniques for making a hash code that will aid in performance. In this case I simply return the hash code of the text because it is the text that I compare in equals to tell if a token is equal to another. Note that if hashCode returns Zero for all tokens, then the search will still work, the question is even performance - it's worth an in-depth look at hash codes .

Consumer Code:

And this is the unit test:

@Test
public void assinalaRepetidasEmUmRaio() {

  String texto = "Dia! É bom ser esperto, não é mesmo? O nosso dia a dia é complicado.";

  String esperado = "Dia! [É] bom ser esperto, não [é] mesmo? O nosso [dia] a [dia] é complicado.";

  String obtido = ProcessadorTexto.assinalaPalavrasRepetidasEmUmRaio(texto, "[", "]", 5);

  assertEquals(esperado, obtido);
}

Note that even words with different captioning are " and " is " >), which was a deficiency of my first response, brought to my attention by the response of the @mgibsonbr . Who does the trick there is the Token.equals method that is used to check if the token is already in the word list. p>

Also note that the first word "" and the last word "is" was not flagged because its closest repetitions are at a distance greater than the specified radius.

About regular expressions used:

The regular expression ( regex ) I used to find each unicode word is \p{L}+ because the simple \ w + in Java gets lost with accented words.

And the regex I used to get everything else that is not a word was the negation of the other expression, ie: [^ \ p {L}] + . This is because Java also finds the accented characters when using the nonword \ W + regex.

And to get all tokens at once (words and not words) I used at the same time the two regular expressions separated by the symbol | pipe ), which can be denoted as "one or other", for example: X | Y = "find both X and Y".