Web Scraping Selenium + Python in site with dynamic generation via JS = difficulty to map elements

4

Good afternoon. I'm developing a script that:

  • access a system;
  • Within the environment, you find certain information;
  • generates a kind of report;
  • creates a spreadsheet with the data.
  • My problem is still before parse. I can access the environment that contains the information, but I can not get the Selenium webdriver to locate the elements to click on to access the data that will appear in the report.

    I get the impression that it's the javascript that's causing the confusion, since the frame information that "fires" the javascript is accessible, and the page with the result, visible to me, does not seem to be visible to the script.

    How to work around javascript?

    How can I make the webdriver "see" the final page the same way I see it?

    (EDITED: Code below:)

    from selenium import webdriver
    import time
    from selenium.common.exceptions import NoSuchFrameException
    import os
    
    if os.path.exists('c:\projudi') == False:
        os.makedirs('c:\projudi')
    
    try:
        planilha = open('c:\projudi\relatorio.csv', 'r+')
    except FileNotFoundError:
        planilha = open('c:\projudi\relatorio.csv', 'w+')
    
    browser = webdriver.Chrome()
    browser.get('https://projudi.tjpr.jus.br/projudi')
    time.sleep(20)
    browser.switch_to_frame('mainFrame')
    browser.switch_to_frame('userMainFrame')
    links = browser.find_elements_by_class_name('link')
    n = len(links)
    
    for x in range(0, n, 2):
        if links[x].text != ('0'):  
            links[x].click()
            time.sleep(2)
            try:
                browser.switch_to_frame('mainFrame')
                browser.switch_to_frame('userMainFrame')
                a = browser.find_elements_by_class_name('link')
            except NoSuchFrameException:
                a = browser.find_elements_by_class_name('link')
            if a != []:
                q = browser.find_elements_by_class_name('resultTable')
                w = q[0].text
                for x in range(len(w)):
                    dados = w.split('\n')
                for x in range(len(dados)):
                    planilha.writelines(dados[x])
                for x in range(int(len(a))):
                    a[x].click()
                    time.sleep(2)
                    browser.back()
                    time.sleep(2)
                    browser.switch_to_frame('mainFrame')
                    browser.switch_to_frame('userMainFrame')
                    a = browser.find_elements_by_class_name('link')
                browser.back()
                time.sleep(2)
            else:       
                browser.back()
                time.sleep(2)
            browser.switch_to_frame('mainFrame')
            browser.switch_to_frame('userMainFrame')
            links = browser.find_elements_by_class_name('link')
    
    planilha.close()    
    browser.close() 
    

    My question: when I access the screen that contains the information I need (resultTable), I get it all and it generates a variable with a string containing all the data. I gave it a split, and I got a list of strings. So far, okay, I set it all up for the report file for further processing. Now ... how to control the FLOW? I already know that I will have to treat the string containing the DATA with regex in the list, since I only need to access the information of the present day until 2 days ago. But how to use this information as REFERENCE to Python? Example: scrip captures the table and plays to a list like this:

      

    list = ['0004434-48.2010',    'UNITY',    '(30 working days) 07/03/2017',    '13 / 07/2017 ',    '0008767-77.2013',    '2017',    '(10 business days) 07/03/2017',    '13 / 07/2017 ']

    The first item in the list is the first item in the table, row 1 and column 1. It contains the link. The control date is in the THIRD item, row 1 column 3. And item 5 is already the next row (row 2, column 1). I do not know if I could explain! = /

    I need: 1 - Check the date. If it is today or yesterday:                           Click the first item on that row.                       If it is not, move on to the next line.

        
    asked by anonymous 28.06.2017 / 21:23

    1 answer

    1

    I do not know if I understood correctly what you want to do, but Selenium has several specific modules to be able to do what you want ... the problem is that you need to go to the html page and see which element is which to be able to capture with selenium.

    from selenium.webdriver.common.keys import Keys         #importa a habilidade de input de chaves e senhas
    from selenium.webdriver.support.ui import Select        #importa a habilidade de usar o select em boxes e pontos
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait #importa a habilidade de setar o 'wait time' do browser
    from selenium.webdriver.support import expected_conditions as EC #importa a biblioteca de condições esperadas
    

    Here are some useful selenium libraries ... Now to check the date of the day and check if the day is the current or the next I would recommend seeing the ID, the name or the id and using the command

    variavel = driver.find_element_by_name('elemento').
    

    Now ... if you have already captured the information and have it played in a file or variable then I suggest using Pandas to organize the information as dataframes.

    To check the dates of a link I would get the link with find_element_by and then it would analyze what pixel the date starts and what pixel it ends (link [n: m]) and thus use datetime to compare the date searched with the current date.

    to get the current date

    import datetime
    from datetime import timedelta
    data_hoje = (datetime.datetime.now()).strftime("%d%m%Y")
    data_ontem = (datetime.datetime.now() - timedelta(days = 1)).strftime("%d%m%Y")
    data_um_dia_n_dias_atras = (datetime.datetime.now() - timedelta(days = n)).strftime("%d%m%Y")
    
        
    10.08.2018 / 19:15