从维基百科文章中提取第一段(Python)

2024-05-14 03:16:27 发布

您现在位置:Python中文网/ 问答频道 /正文

如何使用Python从Wikipedia文章中提取第一段?

例如,对于阿尔伯特·爱因斯坦来说:

Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a theoretical physicist, philosopher and author who is widely regarded as one of the most influential and iconic scientists and intellectuals of all time. A German-Swiss Nobel laureate, Einstein is often regarded as the father of modern physics.[2] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".[3]


Tags: andoftheforisas文章wikipedia
3条回答

我所做的是:

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

article= "Albert Einstein"
article = urllib.quote(article)

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this

resource = opener.open("http://en.wikipedia.org/wiki/" + article)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.find('div',id="bodyContent").p

我写了一个Python库,目的是让这一切变得非常简单。在Github查看。

要安装它,请运行

$ pip install wikipedia

然后要获得文章的第一段,只需使用wikipedia.summary函数。

>>> import wikipedia
>>> print wikipedia.summary("Albert Einstein", sentences=2)

印刷品

Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). While best known for his mass–energy equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), he received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".

就其工作方式而言,wikipedia向MediaWiki API的Mobile Frontend Extension发出请求,后者返回Wikipedia文章的移动友好版本。具体来说,通过传递参数prop=extracts&exsectionformat=plain,MediaWiki服务器将解析Wikitext并返回您正在请求的文章的纯文本摘要,直到并包括整个页面文本。它还接受参数excharsexsentences,这并不奇怪,限制了API返回的字符和语句的数量。

不久前,我为获取纯文本的维基百科文章制作了两个类。我知道这不是最好的解决方案,但你可以根据自己的需要调整它:

wikipedia.py
wiki2plain.py

你可以这样使用它:

from wikipedia import Wikipedia
from wiki2plain import Wiki2Plain

lang = 'simple'
wiki = Wikipedia(lang)

try:
    raw = wiki.article('Uruguay')
except:
    raw = None

if raw:
    wiki2plain = Wiki2Plain(raw)
    content = wiki2plain.text

相关问题 更多 >