从维基百科文章中提取第一段（Python）

网友

1楼 · 编辑于 2024-05-14 03:16:27

我所做的是：

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

article= "Albert Einstein"
article = urllib.quote(article)

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this

resource = opener.open("http://en.wikipedia.org/wiki/" + article)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.find('div',id="bodyContent").p

网友

2楼 · 编辑于 2024-05-14 03:16:27

我写了一个Python库，目的是让这一切变得非常简单。在Github查看。

要安装它，请运行

$ pip install wikipedia

然后要获得文章的第一段，只需使用wikipedia.summary函数。

>>> import wikipedia
>>> print wikipedia.summary("Albert Einstein", sentences=2)

印刷品

Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). While best known for his mass–energy equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), he received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".

就其工作方式而言，wikipedia向MediaWiki API的Mobile Frontend Extension发出请求，后者返回Wikipedia文章的移动友好版本。具体来说，通过传递参数prop=extracts&exsectionformat=plain，MediaWiki服务器将解析Wikitext并返回您正在请求的文章的纯文本摘要，直到并包括整个页面文本。它还接受参数exchars和exsentences，这并不奇怪，限制了API返回的字符和语句的数量。

网友

3楼 · 编辑于 2024-05-14 03:16:27

不久前，我为获取纯文本的维基百科文章制作了两个类。我知道这不是最好的解决方案，但你可以根据自己的需要调整它：

wikipedia.py
wiki2plain.py

你可以这样使用它：

from wikipedia import Wikipedia
from wiki2plain import Wiki2Plain

lang = 'simple'
wiki = Wikipedia(lang)

try:
    raw = wiki.article('Uruguay')
except:
    raw = None

if raw:
    wiki2plain = Wiki2Plain(raw)
    content = wiki2plain.text

相关问题更多 >

编程相关推荐

热门问题

热门文章

从维基百科文章中提取第一段（Python）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >