I have the following situation:
Iwanttocollect"Crawler Text" below, how do I navigate there without a class or id?
<td>Texto para crawler</td>
You can use the BeautifulSoup library next to the .find_all()
to extract all td
tags from a site, without specifying any class
, name
or id
, for example.
Code:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'
html_page = requests.get(url)
html_source = html_page.text
soup = BeautifulSoup(html_source, 'html.parser')
td_tags = soup.find_all('td')
for td in td_tags:
print(td, '\n')
Output:
<td class="mbox-image"><div style="width:52px"><img alt="Globe icon." data-file-height="290" data-file-width="350" height="40" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/48px-Ambox_globe_content.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/73px-Ambox_globe_content.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/97px-Ambox_globe_content.svg.png 2x" width="48"/></div></td>
<td class="mbox-text"><span class="mbox-text-span">The examples and perspective in this article <b>deal primarily with the United States and do not represent a <a href="/wiki/Wikipedia:WikiProject_Countering_systemic_bias" title="Wikipedia:WikiProject Countering systemic bias">worldwide view</a> of the subject</b>. <span class="hide-when-compact">You may <a class="external text" href="//en.wikipedia.org/w/index.php?title=Web_scraping&action=edit">improve this article</a>, discuss the issue on the <a href="/wiki/Talk:Web_scraping" title="Talk:Web scraping">talk page</a>, or <a href="/wiki/Wikipedia:Article_wizard" title="Wikipedia:Article wizard">create a new article</a>, as appropriate.</span> <small><i>(October 2015)</i></small> <small class="hide-when-compact"><i>(<a href="/wiki/Help:Maintenance_template_removal" title="Help:Maintenance template removal">Learn how and when to remove this template message</a>)</i></small></span></td>