Wget如何递归下载websi

wget -r https://stackoverflow.com --2019-10-17 13:11:47-- https://stackoverflow.com/ Resolving stackoverflow.com (stackoverflow.com)... 151.101.65.69, 151.101.1.69, 151.101.129.69, ... Connecting to stackoverflow.com (stackoverflow.com)|151.101.65.69|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 115049 (112K) [text/html] Saving to: ‘stackoverflow.com/index.html’ stackoverflow.com/index.html 100%[========================================================================================>] 112.35K 340KB/s in 0.3s 2019-10-17 13:11:48 (340 KB/s) - ‘stackoverflow.com/index.html’ saved [115049/115049] Loading robots.txt; please ignore errors. --2019-10-17 13:11:48-- https://stackoverflow.com/robots.txt Reusing existing connection to stackoverflow.com:443. HTTP request sent, awaiting response... 200 OK Length: 2094 (2.0K) [text/plain] Saving to: ‘stackoverflow.com/robots.txt’ stackoverflow.com/robots.txt 100%[========================================================================================>] 2.04K --.-KB/s in 0s 2019-10-17 13:11:48 (5.23 MB/s) - ‘stackoverflow.com/robots.txt’ saved [2094/2094] --2019-10-17 13:11:48-- https://stackoverflow.com/opensearch.xml Reusing existing connection to stackoverflow.com:443. HTTP request sent, awaiting response... 200 OK Length: 617 [text/xml] Saving to: ‘stackoverflow.com/opensearch.xml’ stackoverflow.com/opensearch.xml 100%[========================================================================================>] 617 --.-KB/s in 0s 2019-10-17 13:11:49 (20.2 MB/s) - ‘stackoverflow.com/opensearch.xml’ saved [617/617] --2019-10-17 13:11:49-- https://stackoverflow.com/feeds

1条回答

网友

1楼 · 发布于 2024-04-29 01:02:26

WGET几乎没有可以提供过滤的选项

“accept ACCLST”和“reject REJLIST”可用于指定要接受/拒绝的文件后缀。这可以用来限制文件类型-消除图像，css等

在不下载内容的情况下，无法通过搜索关键字（例如“搜索”）来评估内容。要考虑的两个选项：

下载所有内容，然后删除（或忽略）任何与搜索条件不匹配的文件
使用脚本引擎（python-scrapy，Perl）编写定制的spider。更多的工作，但将提供确切的功能

旁注：考虑使用“mirror”而不是“-r”。它还将激活时间戳，这有助于减少重新运行时间

相关问题更多 >

编程相关推荐

热门问题

热门文章