我正试图下载一个网站及其超链接,并列出了只有链接的内容有“搜索”字在它。我怎样才能做到这一点
我已经尝试过通过wget -r --no-parent example.com
递归地完成它,但是似乎也下载了png、css和xml,我想我不需要这些来搜索
wget -r https://stackoverflow.com
--2019-10-17 13:11:47-- https://stackoverflow.com/
Resolving stackoverflow.com (stackoverflow.com)... 151.101.65.69, 151.101.1.69, 151.101.129.69, ...
Connecting to stackoverflow.com (stackoverflow.com)|151.101.65.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 115049 (112K) [text/html]
Saving to: ‘stackoverflow.com/index.html’
stackoverflow.com/index.html 100%[========================================================================================>] 112.35K 340KB/s in 0.3s
2019-10-17 13:11:48 (340 KB/s) - ‘stackoverflow.com/index.html’ saved [115049/115049]
Loading robots.txt; please ignore errors.
--2019-10-17 13:11:48-- https://stackoverflow.com/robots.txt
Reusing existing connection to stackoverflow.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 2094 (2.0K) [text/plain]
Saving to: ‘stackoverflow.com/robots.txt’
stackoverflow.com/robots.txt 100%[========================================================================================>] 2.04K --.-KB/s in 0s
2019-10-17 13:11:48 (5.23 MB/s) - ‘stackoverflow.com/robots.txt’ saved [2094/2094]
--2019-10-17 13:11:48-- https://stackoverflow.com/opensearch.xml
Reusing existing connection to stackoverflow.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 617 [text/xml]
Saving to: ‘stackoverflow.com/opensearch.xml’
stackoverflow.com/opensearch.xml 100%[========================================================================================>] 617 --.-KB/s in 0s
2019-10-17 13:11:49 (20.2 MB/s) - ‘stackoverflow.com/opensearch.xml’ saved [617/617]
--2019-10-17 13:11:49-- https://stackoverflow.com/feeds
还有其他最佳的方法吗
谢谢
WGET几乎没有可以提供过滤的选项
“accept ACCLST”和“reject REJLIST”可用于指定要接受/拒绝的文件后缀。这可以用来限制文件类型-消除图像,css等
在不下载内容的情况下,无法通过搜索关键字(例如“搜索”)来评估内容。要考虑的两个选项:
旁注:考虑使用“mirror”而不是“-r”。它还将激活时间戳,这有助于减少重新运行时间
相关问题 更多 >
编程相关推荐