如何自动抓取wikipedia信息框,并使用python为任何wiki页面打印数据?

2024-04-26 22:07:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我的任务是自动打印wikipediainfobox数据。As例如,我正在抓取《星际迷航》维基百科页面(https://en.wikipedia.org/wiki/Star_Trek)并从右侧提取infobox部分,然后使用python在屏幕上逐行打印它们。我特别想要信息箱。到目前为止,我已经做到了:

from bs4 import BeautifulSoup
import urllib.request
# specify the url
urlpage =  'https://en.wikipedia.org/wiki/Star_Trek'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find results within table
table = soup.find('table', attrs={'class': 'infobox vevent'})
results = table.find_all('tr')
print(type(results))
print('Number of results', len(results))
print(results)

这给了我信息箱里的一切。代码片段如下所示:

^{pr2}$

我只想提取数据并在屏幕上打印出来。所以我想要的是:

Created by  Gene Roddenberry
Original work   Star Trek: The Original Series
Print publications
Book(s) 
List of reference books
List of technical manuals
Novel(s)    List of novels
Comics  List of comics
Magazine(s) 
Star Trek: The Magazine
Star Trek Magazine 

一直到信息盒的结尾。所以基本上是一种打印infobox数据的方法,这样我就可以为任何wiki页面自动打印它了?(所有wiki页面的infobox表的类为'infobox vevent',如代码所示)


Tags: ofthe数据htmlwikipagetable页面
2条回答

通过使用BeautifulGroup,您需要根据需要重新格式化数据。使用fresult = [e.text for e in result]获得每个结果

如果你想在html上读一个表,你可以尝试一些类似这样的代码,尽管这是使用pandas。在

import pandas
urlpage =  'https://en.wikipedia.org/wiki/Star_Trek'
data = pandas.read_html(urlpage)[0]
null = data.isnull()

for x in range(len(data)):
    first = data.iloc[x][0]
    second = data.iloc[x][1] if not null.iloc[x][1] else ""
    print(first,second,"\n")

此页面将帮助您将html解析为不带html标记Using BeautifulSoup Extract Text without Tags的简单字符串

这是那一页的代码,属于@0605002

>>> html = """
<p>
    <strong class="offender">YOB:</strong> 1987<br />
    <strong class="offender">RACE:</strong> WHITE<br />
    <strong class="offender">GENDER:</strong> FEMALE<br />
    <strong class="offender">HEIGHT:</strong> 5'05''<br />
    <strong class="offender">WEIGHT:</strong> 118<br />
    <strong class="offender">EYE COLOR:</strong> GREEN<br />
    <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text


YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN

相关问题 更多 >