从维基百科艺术中获取所有标题和普通文本

2024-05-28 19:57:58 发布

您现在位置:Python中文网/ 问答频道 /正文

在Python中,我如何获取wikipedia文章中的所有标题和平面文本,比如:https://en.wikipedia.org/wiki/Amadeus_(film)。我现在的代码是:

    from bs4 import BeautifulSoup


# ---- Definitions ----#
#Amount of documents
amount_of_documents = 1

#Directory of raw HTML documents
directory_of_raw_documents = "raw_documents/"

#Directory of parsed documents
directory_of_parsed_documents = "parsed_documents/"

# ---- Code ----#


def open_document():
    for i in range (1, 1+1):
        with open(directory_of_raw_documents + str(i), "r") as document:
            html = document.read()
            soup = BeautifulSoup(html, "html.parser")
            body = soup.find('div', id='bodyContent')
            for elements in body.find_all('p'):
                print(elements.text)

open_document()

我正在加载一个下载的HTML文件,然后使用BeautifulSoup获取<p>标记之间的所有内容。我的目标是获取本文的所有标题和纯文本内容。我该怎么做呢?在

在上面的示例中,我想要的输出将包含:

  1. 所有片名(电影、剧情、演员、接待等)
  2. 此页中的所有文本(在<p>标记之间)
  3. 忽略引用

Tags: of文本标题forrawhtmlopenwikipedia
1条回答
网友
1楼 · 发布于 2024-05-28 19:57:58

您可能会对使用专门的wikipedia页面解析器感兴趣,比如^{} package。这样您就可以很容易地获得内容:

In [1]: import wikipedia

In [2]: page = wikipedia.page("Amadeus (film)")

In [3]: page.summary
Out[3]: u"Amadeus is a 1984 American period drama film directed by Milo\u0161 Forman, written by Peter Shaffer, and adapted from Shaffer's stage play Amadeus (1979). The story, set in Vienna, Austria, during the latter half of the 18th century, is a fictionalized biography of Wolfgang Amadeus Mozart. Mozart's music is heard extensively in the soundtrack of the movie. Its central thesis is that Antonio Salieri, an Italian contemporary of Mozart is so driven by jealousy of the latter and his success as a composer that he plans to kill him and to pass off a Requiem, which he secretly commissioned from Mozart as his own, to be premiered at Mozart's funeral. Historically, the Requiem which was never finished was commissioned by Count von Walsegg and Salieri, far from being jealous of Mozart, was on good terms with him and even tutored his son after Mozart's death.\nThe film was nominated for 53 awards and received 40, which included eight Academy Awards (including Best Picture), four BAFTA Awards, four Golden Globes, and a Directors Guild of America (DGA) award. As of 2016, it is the most recent film to have more than one nomination in the Academy Award for Best Actor category. In 1998, the American Film Institute ranked Amadeus 53rd on its 100 Years... 100 Movies list."

In [4]: page.content
Out[4]: u'Amadeus is a 1984 American period drama film directed by Milo\u0161 Forman, written by Peter Shaffer, and adapted from Shaffer\'s s
...
Amadeus Filming locations at Movieloci.com'

关于获取标题,下面是通过BeautifulSoup获取它们的示例代码:

^{pr2}$

h2 .mw-headline是一个CSS selector,它将与h2父元素下的mw-headline类相匹配。在

相关问题 更多 >

    热门问题