How do I extract all image URLs from an HTML (in Java)?

0

I'm processing HTML files that have multiple links to external images. How do I extract just those image links and download them?

HTML example:

<html>
<head>
<meta charset="UTF-8">
<title>A Page</title>
</head>
<body class="wiki >     
    <div id="page-base" class="noprint"></div>

<div id="indicator-default" class="indicator">
    <img alt="Page gold" src="http://m.org/commons/thumb/f/Gold.png"width="150" height="200">
    <img alt="Page silver" src="http://m.org/commons/thumb/f/Silver.png"width="120" height="200">
</div>
</body>
</html>

In this example I want to save the Gold.png and Silver.png image files contained in the HTML to my computer. What resource can I use the Java language for this problem?

    
asked by anonymous 18.03.2017 / 20:00

1 answer

0

Use JSoup to parse HTML:

    Document document = Jsoup.parse(html);
    Elements imgTags = document.getElementsByTag("img");
    for(Element e: imgTags) {
        String url = e.attr("src");
        if(url == null || url.isEmpty()) {
            throw new IOException("Url nula ou vazia: " + e.text());
        }
        String strFile = url.substring(url.lastIndexOf("/")+1);

        LOG.info("Salvando " + url + " para " + IMG_FOLDER + strFile);
        try(InputStream in = new URL(url).openStream()){
            Files.copy(in, Paths.get(IMG_FOLDER + strFile));
        }
    }

IMG_FOLDER is your local directory (ex: "/ home / user / Images /").

  

Saving link for   /home/user/Images/Gold.png

     

Saving link for   /home/user/Images/Silver.png

IMPORTANT: If repeated image names are an issue, add a URL hash or integer counter to differentiate.

This is the common case where the image is represented by the <img> element of HTML. There are some cases where the image is in the <data-image-url> or HTML (base64) tag:

<img width="16" height="16" alt="estrela" src="data:image/gif;base64,R0lGODlhEAAQAMQAAORHHOVSKudfOulrSOp3WOyDZu6QdvCchPGolfO0o/XBs/fNwfjZ0frl3/zy7////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACH5BAkAABAALAAAAAAQABAAAAVVICSOZGlCQAosJ6mu7fiyZeKqNKToQGDsM8hBADgUXoGAiqhSvp5QAnQKGIgUhwFUYLCVDFCrKUE1lBavAViFIDlTImbKC5Gm2hB0SlBCBMQiB0UjIQA7" />

But the principle is the same.

    
18.03.2017 / 20:00