Python，beauthulsoup，re：如何将提取的文本从web转换成字典？

from urllib.request import Request, urlopen from bs4 import BeautifulSoup import re url = "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=183833#null" page = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0'})) soup = BeautifulSoup(page, 'html.parser') results = soup.findAll('tr') for result in results: text = result.get_text().strip() pattern = r"^(Kingdom|Phylum|Division|Class|Order|Family|Genus|Species)[\w]+" if re.match(pattern, text): res = text.split('\n', 1)[0].strip() print(res)

2条回答

网友

1楼 · 编辑于 2024-04-25 04:11:11

对于给定的具体示例，该方法有效：

...
results = soup.findAll('tr')
my_dict = {}
for result in results:
    text = result.get_text().strip()
    pattern = r"^(Kingdom|Phylum|Division|Class|Order|Family|Genus|Species)[\w]+"
    if re.match(pattern, text):
        res = text.split('\n', 1)[0].strip()
        pieces = re.findall(r'[A-Z][ a-z]*', res)
        my_dict[pieces[0]] = pieces[1]
print(my_dict)

输出：

^{pr2}$

这在很大程度上依赖于上面例子中给出的确切格式。例如，如果网站的'Lycaon Pictus'带有'P'作为'Species'的大写字母，那么字典中相应的条目将只是'Lycaon'，而不是{}。在

网友

2楼 · 编辑于 2024-04-25 04:11:11

“结果”是这样的

<td align="left" class="body" width="2%"> </td>
<td align="left" class="body" valign="top" width="24%">Kingdom</td>
<td class="datafield" valign="top" width="71%"><a href="SingleRpt?search_topic=TSN&amp;search_value=202423">Animalia</a> 
 – Animal, animaux, animals</td>
<td class="body" width="5%"> </td>

在它上使用.get_text（）时，它会变成

^{pr2}$

所以在匹配时，应该使用旧的“result”并将列拆分。例如：

if re.match(pattern, text)) :
    pieces = result.findAll('td')

然后用这些片段来找到你的信息，例如

for p in pieces:
    print(p.get_text())

当然，当您处理字符串时，不能期望它返回dictionary。因此，您应该在开始for循环之前创建一个，我们将其称为dictionary

if re.match(pattern, text):
    p = result.findAll('td')
    rank = p[1].get_text().strip()
    taxon = p[2].get_text().split('\xa0')[0]
    dictionary[rank] = taxon

这样你就能得到你要找的词典了

脚本

脚本输出

预期结果

相关问题更多 >

编程相关推荐

热门问题

热门文章