如何从pandas中解析的html页面中提取文本

2024-05-17 14:11:18 发布

您现在位置:Python中文网/ 问答频道 /正文

考虑这个简单的例子

df = pd.DataFrame({'link' : ['https://en.wikipedia.org/wiki/World%27s_funniest_joke',
                             'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World']})

df
Out[169]: 
                                                           link
0         https://en.wikipedia.org/wiki/World%27s_funniest_joke
1  https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World

我想使用beautiful soup解析每个链接,并将解析后的内容存储到数据帧的另一列中。以下几点似乎效果不错:

def puller(mylink):
    doc = requests.get(mylink)
    return BeautifulSoup(doc.content, 'html5lib')

df['parsed'] = df.apply(lambda x: puller(x))
df['mytag'] = df.parsed.apply(lambda x: x.find_all('p'))

问题是,我正在获取列表,我需要处理其中的文本。特别是,我试图在文本中只保留提到joke的段落,但我无法做到这一点

def extractor(mylist):
    return list(filter(lambda x: re.search('joke', x), mylist))

df.mytag.apply(lambda x: extractor(x))
TypeError: expected string or bytes-like object

在这里最好的方法是什么

谢谢


Tags: thelambdainhttpsorgdfwikilink
2条回答

熊猫api设计用于更原始的数据类型;您最好编写一个转换链接的函数->;您需要的文本,然后调用apply。这里有一个解决方案:

import pandas as pd
from bs4 import BeautifulSoup

df = pd.DataFrame({'link' : [
        'https://en.wikipedia.org/wiki/World%27s_funniest_joke',
        'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World'
    ]
})

def parse_link(mylink):
    doc = requests.get(mylink)
    return BeautifulSoup(doc.content, 'html5lib')

def matching_paragraphs(soup, text):
    res = [p.get_text() for p in soup.find_all("p") if text in p.get_text()]
    return res
   
def apply_func(link, text):
    soup = parse_link(link)
    res = matching_paragraphs(soup, text=text)
    return res
    

df['text'] = df.link.apply(apply_func, args=("joke",))

输出:

                                                link                                               text
0  https://en.wikipedia.org/wiki/World%27s_funnie...  [The "world's funniest joke" is a term used by...
1  https://en.wikipedia.org/wiki/The_Funniest_Jok...  ["The Funniest Joke in the World" (also "Joke ...

使用dataframe,您可以更合理地将字符串列表转换为行:

df.explode(column="text", ignore_index=True)

结果:

                                                 link                                               text
0   https://en.wikipedia.org/wiki/World%27s_funnie...  The "world's funniest joke" is a term used by ...
1   https://en.wikipedia.org/wiki/World%27s_funnie...  The winning joke, which was later found to be ...
2   https://en.wikipedia.org/wiki/World%27s_funnie...  Researchers also included five computer-genera...
3   https://en.wikipedia.org/wiki/The_Funniest_Jok...  "The Funniest Joke in the World" (also "Joke W...
4   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The sketch appeared in the first episode of th...
5   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The sketch is framed in a documentary style an...
6   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The British Army are soon eager to determine "...
7   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The German version is described as being "over...
8   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The Germans attempt counter-jokes, but each at...
9   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The British joke is said to have been laid to ...
10  https://en.wikipedia.org/wiki/The_Funniest_Jok...  The footage of Adolf Hitler is taken from Leni...
11  https://en.wikipedia.org/wiki/The_Funniest_Jok...  If the German version of the joke is entered i...

df[mytag]的每个条目都是一个美丽的'<p>'元素的列表。您可以编写一个函数,获取此列表并返回包含您的单词的文本。然后使用.apply覆盖df[mytag]让它在所有行上工作

def myfunc(list_of_ps, word='joke'):
    '''
    This will return a list of string text paragraphs 
    containing the word.
    '''
    result_ps = []
    for p in list of ps:
        if word in p.text:
            result_ps.append(p.text) # p if p itself is needed

    return result_ps if result_ps else None

df['mytag'].apply(myfunc)

编辑:
你问题中的错误反映了上面斜体字提到的事实。re.search需要字符串作为参数。换句话说,该函数调用中的x必须是字符串或类似字节的对象。在本例中,它是作为单个<p>元素的BeautifulSoup对象。该错误可以通过将元素的字符串文本获取为x.text来解决

相关问题 更多 >