如何自动抓取wikipedia信息框，并使用python为任何wiki页面打印数据？

from bs4 import BeautifulSoup import urllib.request # specify the url urlpage = 'https://en.wikipedia.org/wiki/Star_Trek' # query the website and return the html to the variable 'page' page = urllib.request.urlopen(urlpage) # parse the html using beautiful soup and store in variable 'soup' soup = BeautifulSoup(page, 'html.parser') # find results within table table = soup.find('table', attrs={'class': 'infobox vevent'}) results = table.find_all('tr') print(type(results)) print('Number of results', len(results)) print(results)

Created by Gene Roddenberry Original work Star Trek: The Original Series Print publications Book(s) List of reference books List of technical manuals Novel(s) List of novels Comics List of comics Magazine(s) Star Trek: The Magazine Star Trek Magazine

2条回答

网友

1楼 · 编辑于 2024-04-26 22:07:30

通过使用BeautifulGroup，您需要根据需要重新格式化数据。使用fresult = [e.text for e in result]获得每个结果

如果你想在html上读一个表，你可以尝试一些类似这样的代码，尽管这是使用pandas。在

import pandas
urlpage =  'https://en.wikipedia.org/wiki/Star_Trek'
data = pandas.read_html(urlpage)[0]
null = data.isnull()

for x in range(len(data)):
    first = data.iloc[x][0]
    second = data.iloc[x][1] if not null.iloc[x][1] else ""
    print(first,second,"\n")

网友

2楼 · 编辑于 2024-04-26 22:07:30

此页面将帮助您将html解析为不带html标记Using BeautifulSoup Extract Text without Tags的简单字符串

这是那一页的代码，属于@0605002

>>> html = """
<p>
    <strong class="offender">YOB:</strong> 1987<br />
    <strong class="offender">RACE:</strong> WHITE<br />
    <strong class="offender">GENDER:</strong> FEMALE<br />
    <strong class="offender">HEIGHT:</strong> 5'05''<br />
    <strong class="offender">WEIGHT:</strong> 118<br />
    <strong class="offender">EYE COLOR:</strong> GREEN<br />
    <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text


YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN

相关问题更多 >

编程相关推荐

热门问题

热门文章