使用Python 3的webscraping

from bs4 import BeautifulSoup import urllib import re #Gets the html code for scrapping r = urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html').read() #Creates a beautifulsoup object to run soup = BeautifulSoup(r, 'html.parser') #Set removes duplicates lst2 = set() for link in soup.find_all('a'): lst2.add(link.get('href')) lst2 {'#', '#content', '#uscb-nav-skip-header', '/', '/data/tables/time-series/demo/popest/pre-1980-county.html', '/data/tables/time-series/demo/popest/pre-1980-national.html', '/data/tables/time-series/demo/popest/pre-1980-state.html', '/en.html', '/library/publications/2010/demo/p25-1138.html', '/library/publications/2010/demo/p25-1139.html', '/library/publications/2015/demo/p25-1142.html', '/programs-surveys/popest/data.html', '/programs-surveys/popest/data/tables.html', '/programs-surveys/popest/geographies.html', '/programs-surveys/popest/guidance-geographies.html', None, 'https://twitter.com/uscensusbureau', ...}

3条回答

网友

1楼 · 编辑于 2024-04-20 02:24:24

您可以在集合中循环并使用regex过滤集合中的每个元素。对于None，您可以简单地检查值是否为None。你知道吗

网友

2楼 · 编辑于 2024-04-20 02:24:24

URL中的字符#（以及其后的所有内容）与浏览器相关，但在发出web请求时与服务器无关，因此可以从URL中删除这些部分。这将使像'#content'这样的url保持空白，但也会将'/about#contact'更改为'/about'，这实际上是您想要的。从这里开始，我们只需要一个if语句来将非空字符串添加到集合中。这也会同时过滤掉None：

lst2 = set()
for link in soup.find_all('a'):
    url = link.get('href')
    url = url.split('#')[0]
    if url:
        lst2.add(url)

如果您特别想排除'/'（尽管它是一个有效的URL），您只需在末尾写lst2.discard('/')。因为lst2是一个集合，如果它在那里，这将删除它，如果它不在，则什么也不做

网友

3楼 · 编辑于 2024-04-20 02:24:24

尝试以下操作：

set(link.get('href') for link in soup.findAll(name='link') if link.has_attr("href"))

相关问题更多 >

编程相关推荐

热门问题

热门文章