使用BeautifulSoup从表中提取彩色文本

2024-03-29 11:57:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我是Python新手,一般来说对编程相当陌生。我正在尝试编写一个脚本,它使用BeautifulSoup来解析任何红色文本的https://www.state.nj.us/mvc/。我看到的表格是相对简单的HTML:

<html>
 <body>
  <div class="alert alert-warning alert-dismissable" role="alert">
   <div class="table-responsive">
    <table class="table table-sm" align="center" cellpadding="0" cellspacing="0">
     <tbody>
      <tr>
       <td width="24%">
        <strong>
         <font color="red">Bakers Basin</font>
        </strong>
       </td>
       <td width="24%">
        <strong>Oakland</strong>
       </td>
 ...
 ...
 ...
      </tr>
     </tbody>
    </table>
   </div>
  </div>
 </body>
</html>

从上面我想找到贝克盆地,但不是奥克兰,例如

以下是我编写的Python(改编自自学程序员Cory Althoff,2017年,三角形连接LCC):

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        r = urllib.request.urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        soup = BeautifulSoup(html, parser)
        tabledmv = soup.find_all("font color=\"red\"")
        for tag in tabledmv:
            print("\n" + tabledmv.get_text())


website = "https://www.state.nj.us/mvc/"
Scraper(website).scrape()

我似乎在这里遗漏了一些东西,因为我似乎无法把这些东西从桌子上刮下来并归还任何有用的东西。最终的结果是,我想添加时间模块,每X分钟运行一次,然后让它在每个站点变红时在某处记录一条消息。(这一切都是为了让我妻子能够找出新泽西州最不拥挤的车管所!)

非常感谢任何帮助或指导,使BeautifulSoup钻头正常工作


Tags: httpsselfdivparserhtmlwwwtablesite
2条回答

该表实际上是从this站点加载的

要仅获取红色文本,您可以使用CSS选择器soup.select('font[color="red"]'),正如@Mr.Polywhill所提到的:

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        r = urllib.request.urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        soup = BeautifulSoup(html, parser)
        tabledmv = soup.select('font[color="red"]')[1:]
        for tag in tabledmv:
            print(tag.get_text())


website = "https://www.state.nj.us/mvc/locations/agency.htm"
Scraper(website).scrape()

数据从其他位置加载,在本例中为'https://www.state.nj.us/mvc/locations/agency.htm'。要获取每个城镇的城镇+标题,可以使用以下示例:

import requests 
from bs4 import BeautifulSoup


url = 'https://www.state.nj.us/mvc/locations/agency.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for t in soup.select('td:has(font)'):
    i = t.find_previous('tr').select('td').index(t)
    if i < 2:
        print('{:<20} {}'.format(' '.join(t.text.split()), 'Licensing Centers'))
    else:
        print('{:<20} {}'.format(' '.join(t.text.split()), 'Vehicle Centers'))

印刷品:

Bakers Basin         Licensing Centers
Cherry Hill          Vehicle Centers
Springfield          Vehicle Centers
Bayonne              Licensing Centers
Paterson             Licensing Centers
East Orange          Vehicle Centers
Trenton              Vehicle Centers
Rahway               Licensing Centers
Hazlet               Vehicle Centers
Turnersville         Vehicle Centers
Jersey City          Vehicle Centers
Wallington           Vehicle Centers
Delanco              Licensing Centers
Lakewood             Vehicle Centers
Washington           Vehicle Centers
Eatontown            Licensing Centers
Edison               Licensing Centers
Toms River           Licensing Centers
Newton               Vehicle Centers
Freehold             Licensing Centers
Runnemede            Vehicle Centers
Newark               Licensing Centers
S. Brunswick         Vehicle Centers

相关问题 更多 >