如何从网站上包含特定字符串的所有段落中提取文本

2024-04-19 09:35:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我通过这个site有个问题。 我想以表格形式提取我的本地语言及其含义

import requests
from bs4 import BeautifulSoup

res2 = requests.get('https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms')
soup2 = BeautifulSoup(res2.content,'html')

Yoruba = []
English = []
for ol in soup2.findAll('ol'):
   proverb = ol.find('li')
   Yoruba.append(proverb.text)

我成功地将我的本地语言提取到一个列表,我还想将以字符串Meaning:开头的每个句子提取到另一个列表中,例如:[“你的生活状态决定了你对同龄人的态度”,“举止成熟,避免坏名声。”


Tags: fromimport语言列表siterequests形式表格
2条回答

该脚本从谚语、翻译和含义中提取,并从中创建一个数据框架。含义列表位于data['Meaning']内:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get('https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms')
soup = BeautifulSoup(res.content,'html.parser')

data = {'Yoruba':[], 'Translation':[], 'Meaning':[]}
for youruba, translation, meaning in zip(soup.select('ol'), soup.select('ol + p'), soup.select('ol + p + p')):
    data['Yoruba'].append(youruba.get_text(strip=True))
    data['Translation'].append(re.sub(r'Translation:\s*', '', translation.get_text(strip=True)))
    data['Meaning'].append(re.sub(r'Meaning:\s*', '', meaning.get_text(strip=True)))

# print(data['Meaning']) # <  your meanings list

df = pd.DataFrame(data)
print(df)

印刷品:

                                               Yoruba                                        Translation                                            Meaning
0                         Ile oba t'o jo, ewa lo busi  When a king's palace burns down, the re-built ...  Necessity is mother of invention, creativity i...
1   Gbogbo alangba lo d'anu dele, a ko mo eyi t'in...  All lizards lie flat on their stomach and it i...  Everyone looks the same on the outside but eve...
2                           Ile la ti n ko eso re ode                             Charity begins at Home  A man cannot give what he does not have good o...
3                        A pę ko to jęun, ki ję ibaję  The person that eat late, will not eat spoiled...  It is more profitable to exercise patience whi...
4        Eewu bę loko Longę, Longę fun ara rę eewu ni  There is danger at Longę's farm (Longę is a na...  You should be extremely careful of situations ...
5   Bi Ēēgun nla ba ni ohùn o ri gontò, gontò na a...  If a big masquerade claims it doesn't see the ...  If an important man does not respect those les...
6   Kò sí ęni tí ó ma gùn ęşin tí kò ní ju ìpàkó. ...  No one rides a horse without moving his head, ...  Your status in life dictates your attitude tow...
7               Bí abá so òkò sójà ará ilé eni ní bá;  He who throws a stone in the market will hit h...  Be careful what you do unto others it may retu...
8             Agba ki wa loja, ki ori omo titun o wo.     Do not go crazy, do not let the new baby look.  Behave in a mature manner so avoid bad reputat...
9                      Adìẹ funfun kò mọ ara rẹ̀lágbà         The white chicken does not realize its age                                   Respect yourself
10                           Ọbẹ̀ kìí gbé inú àgbà mì   The soup does not move round in an elder’s belly                 You should be able to keep secrets

... and so on

只需搜索所有段落,并检查段落文本是否以“含义”开头

试试这个:

import requests
from bs4 import BeautifulSoup

res2 = requests.get('https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms')
soup2 = BeautifulSoup(res2.content,'html')

yoruba = []
english = []
for ol in soup2.findAll('ol'):
    proverb = ol.find('li')
    yoruba.append(proverb.text)

for paragraph in soup2.findAll('p'):
    if paragraph.text.startswith("Meaning:"):
        english.append(paragraph.text)

english = [x.replace("Meaning: ", "") for x in english]
print(english)

打印出:

[' Necessity is mother of invention, creativity is often achieved after overcoming many difficulties.',
 ' Everyone looks the same on the outside but everyone has problems that are invisible to outsiders.',
...

相关问题 更多 >