Python抓取删除重复项

import requests from bs4 import BeautifulSoup as soup def get_emails(_links:list): for i in range(len(_links)): new_d = soup(requests.get(_links[i]).text, 'html.parser').find_all('a', {'class':'my_modal_open'}) if new_d: yield new_d[-1]['title'] start = 20 while True: d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start={page_id}'.format(page_id=start)).text, 'html.parser') results = [i['href'] for i in d.find_all('a')][52:-9] results = [link for link in results if link.startswith('http://')] next_page=d.find('div', {'class': 'paging'}, 'weiter') if next_page: start+=20 else: break allLinks= set() if results not in allLinks: print(list(get_emails(results))) allLinks.add(results)

2条回答

网友

1楼 · 编辑于 2024-04-25 05:44:23

我成功了，但是我仍然收到重复的电子邮件。在

    allLinks = []

if results not in allLinks:


    print(list(get_emails(results)))

    allLinks.append((results))

有人知道为什么吗？在

网友

2楼 · 编辑于 2024-04-25 05:44:23

您试图将整个电子邮件列表作为一个条目添加到set中。在

您需要的是将实际电子邮件添加到单独的set条目中。在

问题在于：

allLinks.add(results)

它将整个results列表作为一个元素添加到set中，但这不起作用。改用这个：

^{pr2}$

这将使用list中的元素更新{}，但是每个元素都是{}中的一个单独的条目。在

相关问题更多 >

编程相关推荐

热门问题

热门文章