美丽的汤 - 爬取Wiki页面

2024-03-28 11:08:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在wiki页面“https://en.wikipedia.org/wiki/Glossary_of_nautical_terms”上抓取列表,获取每个航海术语的标题/描述,我的第一个问题是正确处理描述中的列表,如下所示:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Glossary_of_nautical_terms'
page = requests.get(url)

get_title = []
get_desc = []
corrected_desc = []
output = ''

if page.status_code == 200:
    soup = BeautifulSoup(page.text, 'html.parser')
    get_title = soup.find_all('dt', class_='glossary')
    get_desc = soup.find_all('dd', class_='glossary')

    for i in get_desc:
        first_char = i.get_text()[:1]
        second_char = i.get_text()[1:2]

        if (first_char.isnumeric() and second_char == '.'):
            if(first_char == '1' and output):
                corrected_desc.append(output)
                output = ''
                output += '{} '.format(i.get_text())
                continue
            else:
                output += '{} '.format(i.get_text())
                continue

        if output:
            corrected_desc.append(output)
            output = ''
            corrected_desc.append(i.get_text())
        else:
            corrected_desc.append(i.get_text())
else:
    print('failed to get the page!')


print(str(len(get_title)) + ' - ' + str(len(corrected_desc)))
zipped = zip(get_title, corrected_desc)

for j in zipped:
    output = '{}, {}\n'.format(j[0].get_text(), j[1].strip())
    with open('test.txt', "a", encoding='utf-8') as myfile:
        myfile.write(output)

但我似乎不知道如何处理既有列表又有句子的描述。在

编辑: 我想要的输出是:

^{pr2}$

但是我不知道如何调整我的代码来处理描述是一个列表+一个句子的情况。在


Tags: textformat列表outputgetiftitlewiki
1条回答
网友
1楼 · 发布于 2024-03-28 11:08:26

所有标题都在<dt>标记内,而描述则在<dd>标记内。所以,第一步是找到所有这些标签。可以使用soup.find_all(['dt', 'dd'])完成。然后,循环这些标记并使用if tag.name == 'dt'检查标记是dt还是{}。如果标记是dd,请将其内容附加到description变量中,否则打印变量的当前内容。在

完整代码:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://en.wikipedia.org/wiki/Glossary_of_nautical_terms')
soup = BeautifulSoup(r.text, 'lxml')

curr_title, curr_description = '', ''
for tag in soup.find_all(['dt', 'dd']):
    if tag.name == 'dt':
        if curr_title:
            print('{}: {}'.format(curr_title, curr_description))
            curr_description = ''
        curr_title = tag.text.strip()
    else:
        curr_description = ' '.join((curr_description, tag.text.strip()))

部分输出:

^{pr2}$

相关问题 更多 >