Generate and download links programmatically

Question

Generate and download links programmatically

Navigation

#1 by (6 votes)
#2 by (3 votes)
#3 by (2 votes)
#4 by (1 votes)

5

There is a database of the National Water Agency that can be accessed by Hidroweb .

To download, just go to:

Hydrological data > Historical series

and include the code of the desired rain gauge.

When chosen the code and type of the post, the site will take you to a page where you can download it in MSAccess or text .

It is worth mentioning that this link is the same for all post codes, varying only the Codigo="" part, where the code of other posts that you want to download will be inserted.

As I have hundreds of posts to be downloaded, it is very costly to download one by one. I would like to download doing loop .

However, the previously mentioned link is not the one used to enter the download.file() function, since it does not redirect directly to the download. The link that redirects is the one that is generated when choosing the type of file to be downloaded MSAccess or .txt and a "Click here" link appears. It is then generated a link type: http://hidroweb.ana.gov.br/ARQ/A20150425-173906-352/CHUVAS.ZIP , where A20150425-173906 is the date and time when the link was accessed (I do not know the meaning of -352 ).

Would anyone know how I could download with an R code?

r

asked by anonymous 25.04.2015 / 22:57

4 answers

3

I've implemented something in Python that can help you. He will download the file and name it with the rank number. That does not answer the question. It would be good if you provide an example so that maybe someone will solve your problem in R.

# hidroweb.py
# -*- coding: utf-8 -*-

# pip install beautifulsoup4
# pip install requests

import requests
import re
import shutil
from bs4 import BeautifulSoup


class Hidroweb(object):

    url_estacao = 'http://hidroweb.ana.gov.br/Estacao.asp?Codigo={0}&CriaArq=true&TipoArq={1}'
    url_arquivo = 'http://hidroweb.ana.gov.br/{0}'

    def __init__(self, estacoes):
        self.estacoes = estacoes

    def montar_url_estacao(self, estacao, tipo=1):
        return self.url_estacao.format(estacao, tipo)

    def montar_url_arquivo(self, caminho):
        return self.url_arquivo.format(caminho)

    def montar_nome_arquivo(self, estacao):
        return u'{0}.zip'.format(estacao)

    def salvar_arquivo_texto(self, estacao, link):
        r = requests.get(self.montar_url_arquivo(link), stream=True)
        if r.status_code == 200:
            with open(self.montar_nome_arquivo(estacao), 'wb') as f:
                r.raw.decode_content = True
                shutil.copyfileobj(r.raw, f)
            print '** %s ** (baixado)' % (estacao, )
        else:
            print '** %s ** (problema)' % (estacao, )

    def obter_link_arquivo(self, response):
        soup = BeautifulSoup(response.content)
        return soup.find('a', href=re.compile('^ARQ/'))['href']

    def executar(self):
        post_data = {'cboTipoReg': '10'}

        for est in self.estacoes:
            print '** %s **' % (est, )
            r = requests.post(self.montar_url_estacao(est), data=post_data)
            link = self.obter_link_arquivo(r)
            self.salvar_arquivo_texto(est, link)
            print '** %s ** (concluído)' % (est, )

if __name__ == '__main__':
    estacoes = ['2851050', '2751025', '2849035', '2750004', '2650032',
                '2850015', ]
    hid = Hidroweb(estacoes)
    hid.executar()

# saída
# ** 2851050 **
# ** 2851050 ** (baixado)
# ** 2851050 ** (concluído)
# ** 2751025 **
# ** 2751025 ** (baixado)
# ** 2751025 ** (concluído)
# ** 2849035 **
# ** 2849035 ** (baixado)
# ** 2849035 ** (concluído)
# ** 2750004 **
# ** 2750004 ** (baixado)
# ** 2750004 ** (concluído)
# ** 2650032 **
# ** 2650032 ** (baixado)
# ** 2650032 ** (concluído)
# ** 2850015 **
# ** 2850015 ** (baixado)
# ** 2850015 ** (concluído)

link

26.04.2015 / 02:26

2

I would like to add that in both the R and Python codes, you need to make a small change to download flow and other data.

No R needs to change here:

  list(cboTipoReg = "10")

In Python:

  post_data = {'cboTipoReg': '10'}

The point is that hardcode 10 is only for rainfall ( CHUVA.ZIP ). If you want other data follow the dictionary below:

value="8" for Dimensions (cm)

value="9" for Flows (m³ / s)

value="12" for Water Quality

value="13" for Download Summary

value="16" for Cross Profile

16.11.2015 / 08:49

1

I've stepped up a few steps in Arthur's script, such as changing the work path and deploying the stations to an external file. I apologize for any gaff, I'm a beginner. I hope I have helped.

___________________________-> Python - Hidroweb <-______________________________

Autor: Arthur Alvin 25/04/2015
[email protected]

Modificação: Jean Favaretto 16/07/2015
[email protected]

Modificação:Vitor Gustavo Geller 16/07/2015
[email protected]

______________________________-> Comentários <-_________________________________

O script Python HidroWeb foi criado  para automatizar o procedimento de aquisição 
de dados das estações do portal: http://hidroweb.ana.gov.br/

Para utilizar o script deverao ser instaladas as bibliotecas:
-> requests
-> beautifulsoup4 (ou superior)

UTILIZACAO:

Apos a instalacao das bibliotecas cria-se um Arquivo de Entrada, com o numero 
das estacoes. A proxima etapa será inicilizar o script, entao ele abrir uma
janela para selecionar o Arquivo de Entrada. Como saída o HidroWeb - Python, 
retorna duas informacoes. A primeira em tela, contendo a situacao do download. 
Por fim, gera-se no mesmo diretorio do Arquivo de Entrada, os arquivos de cada 
estacao que foi possivel realizar a transferencia (baixada).


ARQUIVO DE ENTRADA:

A entrada deve ser um arquivo *.txt contendo o número das estação a serem 
baixadas, com a seguinte estrutura:
-> O número das estacoes defem ser digitadas linhas apos linhas, 
sem cabecalhos, sem espacos, nem separadores (, . ;).
-> Simplismente um Enter após cada numero de estacao. 

02751025
02849035
02750004
02650032
02850015


SAIDAS:

Situação das transferencias em Tela:
** 02851050 **
** 02851050 ** (baixado)
** 02851050 ** (concluído)

No diretorio do Arquivo de Entrada serao criados os arquivos de saida contendo
a informacao disponivel de cada estacao baixada.

OBS: Tenha certeza que todos numeros das estacao existam, caso contrario da 
"BuuuG".
Palavras chave: HidroWeb, ANA, Estacoes, Pluviometricas, Fluviometricas,
Precipitacao, Vazao, Cotas, baixar, download. 
"""

# ********  DECLARACOES INICIAIS
import os
import Tkinter, tkFileDialog
import sys
import requests
import re
import shutil
from bs4 import BeautifulSoup

# By Vitor

# ABRE ARQUIVO DE ENTRADA
root    = Tkinter.Tk()
entrada = tkFileDialog.askopenfile(mode='r')    
root.destroy()

#****************---------------correcao de bug--------------********************
if (entrada == None): 
    sair = raw_input('\tArquivo de entrada nao selecionado. \n\t\tPressione enter para sair.\n')
    sys.exit()
#****************---------------fim da correcao--------------********************

pathname = os.path.dirname(entrada.name) #define o path de trabalho igual ao do arquivo de entrada
os.chdir(pathname)  #muda caminho de trabalho.

VALORES = []

# By Jean

while True:

    conteudo_linha = entrada.read().split("\n")
    VALORES.append(conteudo_linha)

    if (len(conteudo_linha) <= 1):
        break

print VALORES, "\n"


#### By Arthur

class Hidroweb(object):

    url_estacao = 'http://hidroweb.ana.gov.br/Estacao.asp?Codigo={0}&CriaArq=true&TipoArq={1}'
    url_arquivo = 'http://hidroweb.ana.gov.br/{0}'

    def __init__(self, estacoes):
        self.estacoes = estacoes

    def montar_url_estacao(self, estacao, tipo=1):
        return self.url_estacao.format(estacao, tipo)

    def montar_url_arquivo(self, caminho):
        return self.url_arquivo.format(caminho)

    def montar_nome_arquivo(self, estacao):
        return u'{0}.zip'.format(estacao)

    def salvar_arquivo_texto(self, estacao, link):
        r = requests.get(self.montar_url_arquivo(link), stream=True)
        if r.status_code == 200:
            with open(self.montar_nome_arquivo(estacao), 'wb') as f:
                r.raw.decode_content = True
                shutil.copyfileobj(r.raw, f)
            print '** %s ** (baixado)' % (estacao, )
        else:
            print '** %s ** (problema)' % (estacao, )

    def obter_link_arquivo(self, response):
        soup = BeautifulSoup(response.content)
        return soup.find('a', href=re.compile('^ARQ/'))['href']

    def executar(self):
        post_data = {'cboTipoReg': '10'}

        for est in self.estacoes:
            print '** %s **' % (est, )
            r = requests.post(self.montar_url_estacao(est), data=post_data)
            link = self.obter_link_arquivo(r)
            self.salvar_arquivo_texto(est, link)
            print '** %s ** (concluído)' % (est, )

if __name__ == '__main__':
    estacoes = VALORES[::1][0]
    hid = Hidroweb(estacoes)
    hid.executar() '

17.07.2015 / 03:25

Subdomain for files, Site Optimization Make a sub-domain or create a sub-folder?

score 6 · Accepted Answer

Here is a response in R. You will need the packages httr and XML :

install.packages("httr")
install.packages("XML")

I made the code in a simpler way, without creating functions or putting other parameters other than the station code, but with this should be easy to do the rest. As in Arthur Alvim's answer, the files will be saved with the name of the station in the current R workbook.

library(httr)
library(XML)

baseurl <-c("http://hidroweb.ana.gov.br/Estacao.asp?Codigo=", "&CriaArq=true&TipoArq=1")

estacoes <- c(2851050, 2751025, 2849035, 2750004, 2650032, 2850015, 123)

for (est in estacoes){
  r <- POST(url = paste0(baseurl[1], est, baseurl[2]), body = list(cboTipoReg = "10"), encode = "form")
  if (r$status_code == 200) {
    cont <- content(r, as = "text")
    arquivo <- unlist(regmatches(cont, gregexpr("ARQ.+/CHUVAS.ZIP", cont)))
    arq.url <- paste0("http://hidroweb.ana.gov.br/", arquivo)
    download.file(arq.url, paste0(est, ".zip"), mode = "wb")
    cat("Arquivo", est, "salvo com sucesso.\n")
  } else {
    cat("Erro no arquivo", est, "\n")
  }
}

# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005606-786/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 6532 bytes
# downloaded 6532 bytes
# 
# Arquivo 2851050 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005607-172/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 6734 bytes
# downloaded 6734 bytes
# 
# Arquivo 2751025 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005608-703/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 6737 bytes
# downloaded 6737 bytes
# 
# Arquivo 2849035 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005609-783/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 3995 bytes
# downloaded 3995 bytes
# 
# Arquivo 2750004 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005610-492/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 10751 bytes (10 KB)
# downloaded 10 KB
# 
# Arquivo 2650032 salvo com sucesso.
# trying URL 'http://hidroweb.ana.gov.br/ARQ/A20150910-005613-538/CHUVAS.ZIP'
# Content type 'application/x-zip-compressed' length 4625 bytes
# downloaded 4625 bytes