如何将这个XPath表达式转换为BeautifulSoup？

10 投票

4 回答

9289 浏览

提问于 2025-04-15 16:29

在回答一个之前的问题时，有几个人建议我在项目中使用BeautifulSoup。我一直在努力理解他们的文档，但就是看不懂。有人能告诉我应该在哪个部分找到把这个表达式转换成BeautifulSoup表达式的方法吗？

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')

上面的表达式来自Scrapy。我想把正则表达式re('\.a\w+')应用到td class altRow上，以便从中获取链接。

如果有其他教程或文档的推荐，我也很感激。我找不到任何相关的资料。

谢谢你的帮助。

编辑： 我正在查看这个页面：

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>

然而，如果你查看页面源代码，"/cabel"是存在的：

 <td class="altRow" valign="middle" width="34%"> 
 <a href='/cabel'>Abel, Christian</a>

出于某种原因，搜索结果对BeautifulSoup不可见，但对XPath是可见的，因为hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')能抓到"/cabel"

编辑： cobbal：它仍然无法工作。但是当我搜索这个：

>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>

它返回所有第二个字符是"a"的链接，但不包括律师的名字。所以出于某种原因，那些链接（比如"/cabel"）对BeautifulSoup不可见。我不明白为什么。

data extraction xpath web scraping beautifulsoup tutorial html parsing scrapy regex

4 个回答

我刚在Beautiful Soup的邮件列表上回答了这个问题，回应了Zeynel发给列表的邮件。简单来说，网页中有个错误，这个错误在解析时会让Beautiful Soup 3.1完全崩溃，但在Beautiful Soup 3.0中只是让它处理得不太好。

这个讨论的内容可以在Google Groups档案找到。

回答于 2025-04-15 由 Python大师

分享举报

一个选择是使用 lxml（我对 BeautifulSoup 不太熟悉，所以不能说怎么用它），lxml 默认支持 XPath

编辑：
尝试一下 ~~(未测试)~~ 已测试：

soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)

我使用了这个网站上的文档 http://www.crummy.com/software/BeautifulSoup/documentation.html

soup 应该是一个 BeautifulSoup 对象

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)

回答于 2025-04-15 由 Python大师

分享举报

我知道BeautifulSoup是处理HTML的经典模块，但有时候你只是想从一些HTML中提取出一些子字符串，而pyparsing有一些很有用的方法可以做到这一点。使用这段代码：

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

我从你的页面中提取了914个引用，从Abel到Zupikova。

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
AcuÃ±a, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
ZiÃ³Å‚ek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova

回答于 2025-04-15 由 Python大师

分享举报

如何将这个XPath表达式转换为BeautifulSoup？

4 个回答

撰写回答