Data scraping with jsoup and saving in txt

0

Hello, how are you guys? I'm trying to learn how to scrape data on my account, and since my English does not help, I'm turning 30. It's basically the following. While executing my code, he lists the athletes of the International Judo Federation, one below the other. I've found that every iteration, it picks up all of a country at a time. Then the String becomes a block with all the athletes of that country. I'd like to split up to catch one athlete at a time, but I could not. Then, when printing, it prints one underneath the other, however, when sending to txt, it does not do that, it puts everything on the same line and only jumps when it changes parents. Another thing I noticed is that he glues the last name of an athlete to the first of the next.  Example: The code is saving txt:

  

0_ Afghanistan ABDUL HADI Gada KhilAFGHAN ZergulAHMADI Ahmad   ShabirALIPOOR Abdul HadiARMAN KhaledARYAN Mod ReshadASSADI YahyaASSADI   RohullahBAKHSHI Mohammad tawfiqBAREKZAI Ahmad HamedBAYAT HabibaFAIZ   ZADA AjmalFAIZZADA Ajmal FAZLI Abdul FahimHUSSAINI AtefaHUSSAINI Sayed   Hussain

I would like it to look like this:

  

0_ Afghanistan

     

ABDUL HADI Gada Khil

     

AFGHAN Zergul

     

AHMADI Ahmad Shabir

     

ALIPOOR Abdul Hadi

     

ARMAN Khaled

     

ARYAN Mod Reshad

     

ASSADI Yahya

     

ASSADI Rohullah

     

BAKHSHI Mohammad tawfiq

     

BAREKZAI Ahmad Hamed

     

BAYAT Habiba

Follow my code.

package pack;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import org.jsoup.Jsoup;
public class Main {

    @SuppressWarnings("null")
    public static void main(String[] args) throws IOException{
        // Nesse bloco eu estou pegando os paises e suas siglas e inserindo no bd para usar na url de raspagem de atletas.
        org.jsoup.nodes.Document doc = Jsoup.connect("https://www.ijf.org/judoka?name=&nation=all&gender=both&category=all").get();
        String paises = doc.select("option").text().replace(")", ")\n").replaceAll("All","").toString();
        int pos = paises.indexOf(")")+1;
        int quebra = paises.indexOf("\n")+1;
        int i=0;    

        ArrayList<Nacoes>bd = new ArrayList<>();
        String pais = "a", sigla ;
        while (pais.length()>0) {
            Nacoes n = new Nacoes();
            pais = paises.substring(0,pos);
            sigla = pais.substring(pais.length()-4, pais.length()-1);
            pais = pais.substring(0,pais.length()-5);            
            paises = paises.substring(quebra,paises.length());
            quebra = paises.indexOf("\n")+1;
            pos = paises.indexOf(")")+1;
            n.setPais(pais);
            n.setSigla(sigla);
            bd.add(n);      
            i++;
            pais = paises.substring(0,pos);
        }


        File arquivo = new File("C:\ifjAtletas.txt");   
        FileWriter grava = new FileWriter(arquivo);
        PrintWriter escreve = new PrintWriter(grava);

        org.jsoup.nodes.Document doc2 = null;
        String inHtml = ("https://www.ijf.org/judoka?name=&nation=");
        String fimHtml = ("&gender=both&category=all");
        i=0;

        while(i<5) {
            doc2 =  Jsoup.connect(inHtml+bd.get(i).sigla+fimHtml).get();
            String atletas = doc2.select("a").text().toString();
            atletas = atletas.substring(1247, (atletas.length())-54).replace(" "+bd.get(i).sigla+" ","\n");
            escreve.println(i+"_ "+bd.get(i).pais +"\n "+atletas+"\n");

            System.out.println((i+"_ "+bd.get(i).pais +"\n "+atletas+"\n"));
            i++;
        }


        escreve.close();
        grava.close();
    }
}

The Nacoes class has only two strings and their getters / setters.

Follow a piece of code from the site if someone would suggest an easier way.

 {'  <div class="results container-narrow">
                                                                                                        <a href="/judoka/33416" class="judoka">
                            <div class="judoka__profile_image">
                                <img class="" src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/profiles/200/33416.jpg"alt="">
                            </div>
                            <div class="judoka__info">
                                <div class="family_name">ADRIANO</div>
                                <div class="given_name">Gabriel</div>
                                <div class="country">
                                    <img src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/flags/20x15/bra.png"alt="">
                                    BRA
                                </div>
                            </div>
                        </a>
                                                                                            <a href="/judoka/1039" class="judoka">
                            <div class="judoka__profile_image">
                                <img class="" src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/profiles/200/1039.jpg"alt="">
                            </div>
                            <div class="judoka__info">
                                <div class="family_name">AGUIAR</div>
                                <div class="given_name">Mayra</div>
                                <div class="country">
                                    <img src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/flags/20x15/bra.png"alt="">
                                    BRA
                                </div>
                            </div>
                        </a>'
}

Thank you guys, hugs.

    
asked by anonymous 29.10.2018 / 19:09

1 answer

0

From what I understand, researching the problem lies in this line:

escreve.println(i+"_ "+bd.get(i).pais +"\n "+atletas+"\n");

Using PrintWriter in Windows, the correct thing is to use "\ r \ n" and print instead of println to skip lines and not just "\ n", then it would have to be

escreve.print(i+"_ "+bd.get(i).pais +"\r\n"+atletas+"\r\n");

Another way to do this line break is to use System.getProperty("line.separator");

    
29.10.2018 / 20:40