Solution with \W
In regular expressions implemented in Java, as documentation of class Pattern
, there is a character type \w
(lowercase), which represents the characters that form the word . It would be the same as [a-zA-Z_0-9]
.
There is also the character class \W
(uppercase), which represents the opposite of the previous one, that is, characters that do not form words.
A simplistic solution would be to use \W
to break the String by any non-word character, including any punctuation and space.
But there are problems with this approach:
-
Does not consider special characters that are commonly part of words, such as hyphen, for example.
-
Does not consider accented characters because they are not part of the
\w
character set.
Specific solution
A more specific solution would be to define a set of characters that should "break" the String. Example:
String caracteres = " #@_\/.*";
Then, you place these characters in square brackets, which in regular expressions means a custom class of characters. Example:
String words[] = line.split("[" + Pattern.quote(caracteres) + "]");
The Pattern.quote
method above ensures that the characters will receive the escape needed to not spoil the regular expression.
Complete example
String line = "1 2#3@4_5/6.7*8";
String caracteres = " #@_\/.*";
String words[] = line.split("[" + Pattern.quote(caracteres) + "]");
for (String string : words) {
System.out.print(string + " ");
}
Output:
1 2 3 4 5 6 7 8
Special characters in sequence
With the expression above, blank words may be left in the vector if two special characters or spaces are found in sequence. This is common in the case of a sentence that contains an end point or comma followed by a blank space.
To prevent this from happening, just add a +
to the direct of the custom class so that split
captures the special string in a single block, all at once. Example:
String words[] = line.split("[" + Pattern.quote(caracteres) + "]+");