如何在静态语料库中过滤出爬虫陷阱 - 问答 - Python中文网

如何在静态语料库中过滤出爬虫陷阱

2024-04-18 22:45:25 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我正在做一个家庭作业，要求我们写一个程序来抓取给定的静态语料库。在输出中，我的代码打印了所有爬网的url，但是我知道有些是陷阱，但是我想不出一种方法来用Pythonic的方式过滤掉它们。你知道吗

我使用regex来过滤像tap一样的url内容，但是这在作业中是不允许的，因为这被认为是硬编码。你知道吗

https://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=4d26fc0839d47d4ec13c5461c1ed6d96

http://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d8b984cc6aa00bd1ef20471ac5150094

https://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d8b984cc6aa00bd1ef20471ac5150094

http://cbcl.ics.uci.edu/doku.php/software/arem?do=login&sectok=d504a3676483838e82f07064ca3e12ee

还有更多类似的结构。也有类似结构的日历URL，仅更改日期：

http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=22&month=01&year=2017

http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=25&month=01&year=2017

http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=26&month=01&year=2017

http://calendar.ics.uci.edu/calendar.php?type=day&calendar=1&category=&day=27&month=01&year=2017

我想把这些从我的结果中过滤掉，但我想不出任何办法。你知道吗

Tags： http type login software do calendar ics php

1条回答

网友

1楼 · 发布于 2024-04-18 22:45:25

我想这会解决你的问题

    import requests

    for url in urls:
        try:
            response = requests.get(url)
            # If the response was successful, no Exception will be raised
            response.raise_for_status()
        except Exception as err:
            print(f'Other error occurred: {err}')
        else:
            print('Url is valid!')

相关问题更多 >

编程相关推荐

热门问题

热门文章