Manipulating String in Java

Question

Manipulating String in Java

Navigation

#1 by (8 votes)
#2 by (3 votes)

9

I have text inside a String. I'll go through this String. As I go through it, I must take every word it contains. I thought about using string.split(" "); but I need to treat. "" ";" "," ":" "!" "?" among other cases. How can I do this?

java string

asked by anonymous 05.08.2014 / 15:17

2 answers

3

You can use regular expression for this purpose.

Split

The split method of String Java accepts regular expression. See here in the documentation .

Something like this: [.;,:!?] (it's a group of characters you want to filter).

This will split the entered characters by returning an Array.

In Java it would look something like this:

String str = "Eu sei? que nada, sei, mais uns .'s e umas ,'s";
String[] result = str.split("[.;,:!?]");
for (String r : result) {
    System.out.println(r.toString());       
}

The output would look like this:

Eu sei
 que nada
 sei
 mais uns 
's e umas 
's

Replace

You can also replace in unwanted characters. The replaceAll method of the java String also accepts regular expression, see here in the documentation .

It would look something like this:

String result2 = str.replaceAll("[.;,:!?]", "");
System.out.println(result2);

The output would look like this:

Eu sei que nada sei mais uns 's e umas 's

As I understand it, it is something that you are looking for. Right?

See if that suits you.

05.08.2014 / 15:27

What is the difference between Template and Layout? Can a video accelerator card improve non-graphical performance? [closed]

score 8 · Accepted Answer

You can use Regex. Example:

public class TesteRegex {
    public static void main(String[] args) {
        String frase = "Várias palavras em uma só String.\n"
                + "Ignorando pontos; Ponto-e-vírgula; Traços. E números 0132.";
        Pattern p = Pattern.compile("[a-zA-Zà-úÀ-Ú]+");
        Matcher m = p.matcher(frase);
        int i = 1;
        while(m.find()) {
            System.out.println("Palavra " + i + ": " + m.group());
            i++;
        }
        System.out.println("Frase completa: " + frase);
    }
}

Result:

Word 1: Several   Word 2: words
  Word 3: in
  Word 4: One
  Word 5: only
  Word 6: String
  Word 7: Ignoring
  Word 8: points
  Word 9: Point
  Word 10: and
  Word 11: comma
  Word 12: Traits
  Word 13: E
  Word 14: numbers
  Complete sentence: Multiple words in a single String.
  Ignoring stitches; Semicolon; Traits. And numbers 0132.

The pattern I used [a-zA-Zà-úÀ-Ú]+ informs that it is to include everything from a to z and everything from à to ú , for both upper and lower case. The + sign indicates to catch groups instead of single characters.

Consequently all the rest will be ignored, this includes all spaces, special characters and numbers, as you can see in the example above.

Looking at the Unicode character list we can see that the range from à to ú takes some characters that may be considered undesirable, such as æ , å , ÷ and ø . See the complete excerpts:

From À to Ú : × É Ò Ò Ò Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú Ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú.

If you are taking files from a variety of sources you may come across them at certain times, if you are reading through a TextField that the user is typing I would say that it is unnecessary to delete such characters from the list because someone will hardly type a Å in the middle of a text, our keyboard is not even ready for it (I even had to copy and paste it).

But if you prefer, you can use a more specific pattern that accepts only the characters we use in our alphabet, which would be: [a-zA-ZàáâãçèéêìíòóõùúÀÁÂÃÇÈÉÊÌÍÎÒÓÔÕÙÚ]+

Notice that the - sign indicates ranges of values, so à-ú accepts everything from à to ú , and in the pattern above I did not use the range of values for the accented characters, I I have specified one by one which characters are to be accepted. For the unaccented I kept a-zA-Z , as there is no unwanted character in the middle of them.