如何从pandas中解析的html页面中提取文本

2条回答

网友

1楼 · 编辑于 2024-05-17 14:11:18

熊猫api设计用于更原始的数据类型；您最好编写一个转换链接的函数->；您需要的文本，然后调用apply。这里有一个解决方案：

import pandas as pd
from bs4 import BeautifulSoup

df = pd.DataFrame({'link' : [
        'https://en.wikipedia.org/wiki/World%27s_funniest_joke',
        'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World'
    ]
})

def parse_link(mylink):
    doc = requests.get(mylink)
    return BeautifulSoup(doc.content, 'html5lib')

def matching_paragraphs(soup, text):
    res = [p.get_text() for p in soup.find_all("p") if text in p.get_text()]
    return res
   
def apply_func(link, text):
    soup = parse_link(link)
    res = matching_paragraphs(soup, text=text)
    return res
    

df['text'] = df.link.apply(apply_func, args=("joke",))

输出：

                                                link                                               text
0  https://en.wikipedia.org/wiki/World%27s_funnie...  [The "world's funniest joke" is a term used by...
1  https://en.wikipedia.org/wiki/The_Funniest_Jok...  ["The Funniest Joke in the World" (also "Joke ...

使用dataframe，您可以更合理地将字符串列表转换为行：

df.explode(column="text", ignore_index=True)

结果:

                                                 link                                               text
0   https://en.wikipedia.org/wiki/World%27s_funnie...  The "world's funniest joke" is a term used by ...
1   https://en.wikipedia.org/wiki/World%27s_funnie...  The winning joke, which was later found to be ...
2   https://en.wikipedia.org/wiki/World%27s_funnie...  Researchers also included five computer-genera...
3   https://en.wikipedia.org/wiki/The_Funniest_Jok...  "The Funniest Joke in the World" (also "Joke W...
4   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The sketch appeared in the first episode of th...
5   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The sketch is framed in a documentary style an...
6   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The British Army are soon eager to determine "...
7   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The German version is described as being "over...
8   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The Germans attempt counter-jokes, but each at...
9   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The British joke is said to have been laid to ...
10  https://en.wikipedia.org/wiki/The_Funniest_Jok...  The footage of Adolf Hitler is taken from Leni...
11  https://en.wikipedia.org/wiki/The_Funniest_Jok...  If the German version of the joke is entered i...

网友

2楼 · 编辑于 2024-05-17 14:11:18

df[mytag]的每个条目都是一个美丽的'<p>'元素的列表。您可以编写一个函数，获取此列表并返回包含您的单词的文本。然后使用.apply覆盖df[mytag]让它在所有行上工作

def myfunc(list_of_ps, word='joke'):
    '''
    This will return a list of string text paragraphs 
    containing the word.
    '''
    result_ps = []
    for p in list of ps:
        if word in p.text:
            result_ps.append(p.text) # p if p itself is needed

    return result_ps if result_ps else None

df['mytag'].apply(myfunc)

编辑：
你问题中的错误反映了上面斜体字提到的事实。re.search需要字符串作为参数。换句话说，该函数调用中的x必须是字符串或类似字节的对象。在本例中，它是作为单个<p>元素的BeautifulSoup对象。该错误可以通过将元素的字符串文本获取为x.text来解决

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从pandas中解析的html页面中提取文本

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >