Hello, how are you guys? I'm trying to learn how to scrape data on my account, and since my English does not help, I'm turning 30. It's basically the following. While executing my code, he lists the athletes of the International Judo Federation, one below the other. I've found that every iteration, it picks up all of a country at a time. Then the String becomes a block with all the athletes of that country. I'd like to split up to catch one athlete at a time, but I could not. Then, when printing, it prints one underneath the other, however, when sending to txt, it does not do that, it puts everything on the same line and only jumps when it changes parents. Another thing I noticed is that he glues the last name of an athlete to the first of the next. Example: The code is saving txt:
0_ Afghanistan ABDUL HADI Gada KhilAFGHAN ZergulAHMADI Ahmad ShabirALIPOOR Abdul HadiARMAN KhaledARYAN Mod ReshadASSADI YahyaASSADI RohullahBAKHSHI Mohammad tawfiqBAREKZAI Ahmad HamedBAYAT HabibaFAIZ ZADA AjmalFAIZZADA Ajmal FAZLI Abdul FahimHUSSAINI AtefaHUSSAINI Sayed Hussain
I would like it to look like this:
0_ Afghanistan
AHMADI Ahmad Shabir
ALIPOOR Abdul Hadi
ARMAN Khaled
ARYAN Mod Reshad
ASSADI Rohullah
BAKHSHI Mohammad tawfiq
BAREKZAI Ahmad Hamed
BAYAT Habiba
Follow my code.
package pack;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import org.jsoup.Jsoup;
public class Main {
public static void main(String[] args) throws IOException{
// Nesse bloco eu estou pegando os paises e suas siglas e inserindo no bd para usar na url de raspagem de atletas.
org.jsoup.nodes.Document doc = Jsoup.connect("https://www.ijf.org/judoka?name=&nation=all&gender=both&category=all").get();
String paises = doc.select("option").text().replace(")", ")\n").replaceAll("All","").toString();
int pos = paises.indexOf(")")+1;
int quebra = paises.indexOf("\n")+1;
int i=0;
ArrayList<Nacoes>bd = new ArrayList<>();
String pais = "a", sigla ;
while (pais.length()>0) {
Nacoes n = new Nacoes();
pais = paises.substring(0,pos);
sigla = pais.substring(pais.length()-4, pais.length()-1);
pais = pais.substring(0,pais.length()-5);
paises = paises.substring(quebra,paises.length());
quebra = paises.indexOf("\n")+1;
pos = paises.indexOf(")")+1;
pais = paises.substring(0,pos);
File arquivo = new File("C:\ifjAtletas.txt");
FileWriter grava = new FileWriter(arquivo);
PrintWriter escreve = new PrintWriter(grava);
org.jsoup.nodes.Document doc2 = null;
String inHtml = ("https://www.ijf.org/judoka?name=&nation=");
String fimHtml = ("&gender=both&category=all");
while(i<5) {
doc2 = Jsoup.connect(inHtml+bd.get(i).sigla+fimHtml).get();
String atletas = doc2.select("a").text().toString();
atletas = atletas.substring(1247, (atletas.length())-54).replace(" "+bd.get(i).sigla+" ","\n");
escreve.println(i+"_ "+bd.get(i).pais +"\n "+atletas+"\n");
System.out.println((i+"_ "+bd.get(i).pais +"\n "+atletas+"\n"));
The Nacoes class has only two strings and their getters / setters.
Follow a piece of code from the site if someone would suggest an easier way.
{' <div class="results container-narrow">
<a href="/judoka/33416" class="judoka">
<div class="judoka__profile_image">
<img class="" src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/profiles/200/33416.jpg"alt="">
<div class="judoka__info">
<div class="family_name">ADRIANO</div>
<div class="given_name">Gabriel</div>
<div class="country">
<img src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/flags/20x15/bra.png"alt="">
<a href="/judoka/1039" class="judoka">
<div class="judoka__profile_image">
<img class="" src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/profiles/200/1039.jpg"alt="">
<div class="judoka__info">
<div class="family_name">AGUIAR</div>
<div class="given_name">Mayra</div>
<div class="country">
<img src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/flags/20x15/bra.png"alt="">
Thank you guys, hugs.