我试图在wiki页面“https://en.wikipedia.org/wiki/Glossary_of_nautical_terms”上抓取列表,获取每个航海术语的标题/描述,我的第一个问题是正确处理描述中的列表,如下所示:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Glossary_of_nautical_terms'
page = requests.get(url)
get_title = []
get_desc = []
corrected_desc = []
output = ''
if page.status_code == 200:
soup = BeautifulSoup(page.text, 'html.parser')
get_title = soup.find_all('dt', class_='glossary')
get_desc = soup.find_all('dd', class_='glossary')
for i in get_desc:
first_char = i.get_text()[:1]
second_char = i.get_text()[1:2]
if (first_char.isnumeric() and second_char == '.'):
if(first_char == '1' and output):
corrected_desc.append(output)
output = ''
output += '{} '.format(i.get_text())
continue
else:
output += '{} '.format(i.get_text())
continue
if output:
corrected_desc.append(output)
output = ''
corrected_desc.append(i.get_text())
else:
corrected_desc.append(i.get_text())
else:
print('failed to get the page!')
print(str(len(get_title)) + ' - ' + str(len(corrected_desc)))
zipped = zip(get_title, corrected_desc)
for j in zipped:
output = '{}, {}\n'.format(j[0].get_text(), j[1].strip())
with open('test.txt', "a", encoding='utf-8') as myfile:
myfile.write(output)
但我似乎不知道如何处理既有列表又有句子的描述。在
编辑: 我想要的输出是:
^{pr2}$但是我不知道如何调整我的代码来处理描述是一个列表+一个句子的情况。在
所有标题都在}。如果标记是
<dt>
标记内,而描述则在<dd>
标记内。所以,第一步是找到所有这些标签。可以使用soup.find_all(['dt', 'dd'])
完成。然后,循环这些标记并使用if tag.name == 'dt'
检查标记是dt
还是{dd
,请将其内容附加到description
变量中,否则打印变量的当前内容。在完整代码:
部分输出:
^{pr2}$相关问题 更多 >
编程相关推荐