Collecting links in a text file

1

I have a problem in my program where I have to get links throughout the text file. I used regex to collect the links, however, only a few lines are captured (the file has approximately 3500 rows) and some links. Note: The text file is from WhatsApp text. This is my code so far:

private void jButton2ActionPerformed(java.awt.event.ActionEvent evt) {                                         

        // TODO add your handling code here:
    File arq;
    //FileWriter saveArq;
    Scanner lerArq;
    try {

        arq = new File("/home/kayck/Documentos/list.txt");
        //saveArq = new FileWriter("/home/kayck/Área de Trabalho/Salvo.txt");
        String listaFinal = "";
        String pattern = "https.*. ";
        Pattern r = Pattern.compile(pattern);
        lerArq = new Scanner(arq, "UTF-8");                      
        Matcher matcher = r.matcher("");
          while (lerArq.hasNextLine()){
            String line = lerArq.nextLine();
            matcher.reset(line);
            while (matcher.find()) {
              System.out.println(matcher.group());
            }
          }

        lerArq.close();
        //saveArq.write(listaFinal);
        //saveArq.close();
        JOptionPane.showMessageDialog(null, "Salvo com sucesso!", "Concluído", JOptionPane.INFORMATION_MESSAGE);
    } catch (FileNotFoundException ex) {
        Logger.getLogger(TelaPrincipal.class.getName()).log(Level.SEVERE, null, ex);
        JOptionPane.showMessageDialog(null, "Arquivo não carregado.", "Error", JOptionPane.INFORMATION_MESSAGE);
    } catch (IOException ex) {
        Logger.getLogger(TelaPrincipal.class.getName()).log(Level.SEVERE, null, ex);
    }

}

And here is the link to the txt file: link

And the output:

https://chat.whatsapp.com/HZXXYmjcVAY6AZeR34djup 
https://chat.whatsapp.com/LNM1FIhkrAA5kJ5tuwYvO8 [6/11 10:41 PM] +55 74 8844-2659‬: 
https://chat.whatsapp.com/57WiGUbr8JQHvw48bUpD4v só clicar no link pra 
https://chat.whatsapp.com/Cqo7yy6G0WZ0yPKBcxeOQy Vamos de 
https://chat.whatsapp.com/FG0EfM0l7LfCTZhkgcX23F 
https://chat.whatsapp.com/308yhRFusJo2dj94PqONMw [29/10 11:24 PM] +55 21 96680-8612‬: Acesse este link para entrar no meu grupo do WhatsApp: 
https://chat.whatsapp.com/GUBItIHCK84BrnR0LiYoOA 
https://chat.whatsapp.com/8N98Ei6IkzA2jbcgi3diRz quem quiser entra só clica no link 
https://chat.whatsapp.com/30nqyyplmQO2ooTfqMBQiv quem gosta de uma música sertaneja se quiser participar do meu Grupo add ai 
https://chat.whatsapp.com/BiHEx7MZxpG6VN5YQr8jL1  (pra entrar no grupo só clickar no 
https://chat.whatsapp.com/EpxEUAA7VLx1Yr451hiZvH / Se quiser entrar 
https://chat.whatsapp.com/8N98Ei6IkzA2jbcgi3diRz quem quiser entra só clica no link 
https://chat.whatsapp.com/30nqyyplmQO2ooTfqMBQiv quem gosta de uma música sertaneja se quiser participar do meu Grupo add ai 
https://chat.whatsapp.com/BiHEx7MZxpG6VN5YQr8jL1  (pra entrar no grupo só clickar no 
https://chat.whatsapp.com/6ev8xZIsXdIBVuhSrvjs3q *DIVULGA AE PRA 
https://chat.whatsapp.com/JOiJQ2F1hXb39H9kHC3eC9  
https://chat.whatsapp.com/K4tk0xyRAPuHzlLGiMAKNR regras do grupo / nao ofender os participantes do grupo / nao mudar a imagem e nem o nome do grupo /nao enviar correntes /nao mandar pornografias /se nao cumprirem as regras do grupo sera removido/fora isso sejam todos bem vindos e aproveitem o grupo grato pela 
https://chat.whatsapp.com/6vQZWAFDMDR7ao3n7iCWCM regras do grupo pode mandar musicas do legião videos /fazer comentários/pode tambem mandar musicas de outras bandas brasileiras de rock tambem/mais esse grupo e especialmente pra musicas do legião urbana/videos /nao mandar correntes /nao mudar a imagem do grupo e nem o nome / nao mandar pornografias/nao ofender os participantes do grupo se nao sera removido severamente grato pela compreensão                     
asked by anonymous 14.11.2016 / 16:48

1 answer

1

Using a simple logic.

  

All Url starts with http and ends in the first space.

You can use pattern : .*(http[^ ]+).*

Since you have multiple rows, but urls are not broken in 2 rows, use the m (Java Pattern.MULTILINE ) modifier to define that each row is independent of each other.

With this you can do both by replace $1 and go changing the content of the lines, keeping only the url.

Or by match , in which url will be in group 1 of match .

    
17.11.2016 / 14:56