How to use more than one separator character in the split () method?

13

I'd like to break String into several substrings, for this I'm using the split() method. It turns out, I'm not sure what characters can be in the variable I'll use.

Exemplifying:

String words[] = line.split(" ");

This code matches what I'm needing, I'm assuming that only " " will be used to separate the words. But the problem is that this input will be read from a text file where user can place any character between words.

So, I would need to create something like:

String words[] = line.split(" #@_\/.*");

Is it possible to do this in Java? Any solution?

    
asked by anonymous 26.11.2014 / 18:09

3 answers

13

One possibility is:

    String a = "Exemplo, de. separar- string+ por* carater";
    //Como quer todos os caracteres pode usar esta expressão regular:
    String[] parts = a.split("[\W]");

    for(String i:parts){
        System.out.println("===" +i);
    }

Output:

run:
===Exemplo
===
===de
===
===separar
===
===string
===
===por
===
===carater

To remove the spaces you should also change this line of code:

String[] parts = a.split("[\W][ ]");

Output:

===Exemplo
===de
===separar
===string
===por
===carater
    
26.11.2014 / 18:25
12

Solution with \W

In regular expressions implemented in Java, as documentation of class Pattern , there is a character type \w (lowercase), which represents the characters that form the word . It would be the same as [a-zA-Z_0-9] .

There is also the character class \W (uppercase), which represents the opposite of the previous one, that is, characters that do not form words.

A simplistic solution would be to use \W to break the String by any non-word character, including any punctuation and space.

But there are problems with this approach:

  • Does not consider special characters that are commonly part of words, such as hyphen, for example.
  • Does not consider accented characters because they are not part of the \w character set.

Specific solution

A more specific solution would be to define a set of characters that should "break" the String. Example:

String caracteres = " #@_\/.*";

Then, you place these characters in square brackets, which in regular expressions means a custom class of characters. Example:

String words[] = line.split("[" + Pattern.quote(caracteres) + "]");

The Pattern.quote method above ensures that the characters will receive the escape needed to not spoil the regular expression.

Complete example

String line = "1 2#3@4_5/6.7*8";
String caracteres = " #@_\/.*";
String words[] = line.split("[" + Pattern.quote(caracteres) + "]");
for (String string : words) {
    System.out.print(string + " ");
}

Output:

  

1 2 3 4 5 6 7 8

Special characters in sequence

With the expression above, blank words may be left in the vector if two special characters or spaces are found in sequence. This is common in the case of a sentence that contains an end point or comma followed by a blank space.

To prevent this from happening, just add a + to the direct of the custom class so that split captures the special string in a single block, all at once. Example:

String words[] = line.split("[" + Pattern.quote(caracteres) + "]+");
    
26.11.2014 / 18:45
5

I think this solves your problem:

import java.io.*;

 class Test{
   public static void main(String args[]){
      String line = new String("banana*batata.pepino#alface_tomate@cenoura cebola/abacate|morango\laranja");
      for (String retval: line.split(" |#|@|_|\\|\/|\.|\*") ){
         System.out.println(retval);
      }

   }
}

See running on ideone . And in Coding Ground . Also I placed in GitHub for future reference .

I'm using a RegEx or operator after all split() is based on RegEx .

    
26.11.2014 / 18:23