从不支持通配符的网站下载所有PDF文件

0 投票

2 回答

59 浏览

提问于 2025-04-12 04:02

我想下载网站“https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml”上的所有PDF文件。我尝试了很多方法，用wget命令如下：

wget --wait 10 --random-wait --continue https://journals.ametsoc.org/downloadpdf/view/journals/mwre/131/5/1520-0493_2003_131_*.co_2.pdf

但是我收到了这个提示：

警告：HTTP不支持通配符。

--2024-03-29 23:01:27-- https://journals.ametsoc.org/downloadpdf/view/journals/mwre/131/5/1520-0493_2003_131_*.co_2.pdf

正在解析journals.ametsoc.org (journals.ametsoc.org)... 54.73.220.207, 52.208.161.60

正在连接到journals.ametsoc.org (journals.ametsoc.org)|54.73.220.207|:443... 已连接。

HTTP请求已发送，等待响应... 500

2024-03-29 23:01:28 错误 500: (没有描述)。

有没有办法用wget、Python或其他工具做到这一点？谢谢！

通配符错误处理 wget http请求 pdf下载网站抓取数据下载网络工具

2 个回答

在简单的情况下，你不需要用到Python，只要用系统自带的工具就可以了。

Unix哲学（DOS也是这样，但Windows的CMD.exe更好）就是写一些可以重复使用的命令块，以适应特定的情况。你需要写一组命令来满足你的目标，这样代码的某些部分需要具体化，而其他部分则可以通用。

因此，我们只需要HTML的“获取和编辑”功能，这可以做到“写一次，多次使用”（WORM）。

在这里，我停在了runget的第一层，它提供了每个PDF的链接。但在第二阶段可以用来获取所有这些文件。例如，Pass1.htm允许手动逐个下载选定的文件。你可以通过简单地不包含那个调用来跳过这一步。

GET.CMD（可以被任何其他的.BAT文件使用）

@echo off
if [%2]==[] goto usage
if /i [%1]==[file$] goto getfile$
if not [%4]==[] goto editlines
:getdata
curl -o scrape.txt "%~2"
type scrape.txt |find "%~1" >listurls.htm & exit /b
:editlines
powershell -Command "(gc '%~1') -replace '%~2', '%~3' | sc '%~4'"
exit /b
:getfile$
for /F "eol=;" %%f in (%~2) do curl -O %%f
pause & exit /b
:usage
echo %~n0 string URL
echo e.g. %~n0 ".pdf" https://example.com/file.htm
pause

Phase1.bat

call get "2.xml" https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml
type listurls.htm & pause
call get listurls.htm "/abstract" "https://journals.ametsoc.org/downloadpdf/view" pass1.txt
call get pass1.txt ".xml" ".pdf" pass2.txt
call get pass2.txt ">" ">a pdf</a></br>" pass1.htm
pass1.htm & pause
notepad pass1.htm
del pass?.txt
call get file$ filelist.txt

在第二阶段，我们需要继续查找和替换输出，把pass1.htm转换成filelist.txt，然后运行do curl -O filelist.txt

所以你可以在任何文本编辑器中做到这一点，比如记事本（上面有调用），因为对于单个特定情况来说，在本地系统中编辑要快得多，而不是再写六行代码。这样做的好处是你可以排除一些文件，并调整第一阶段的错误。

在Windows中，下载列表中的所有文件的方法是：

for /F "eol=;" %f in (filelist.txt) do curl -O %f

或者在一个批处理文件中

for /F "eol=;" %%f in (filelist.txt) do curl -O %%f

回答于 2025-04-12 由 Python大师

分享举报

根据我的理解，你想从一个网页上抓取数据，所以这和文件管理器的工作方式不一样。你需要使用Python中的Beautifulsoap或者Lxml库。下面的代码使用了lxml库，应该能满足你的需求。它会把PDF文件保存到运行代码的文件夹里：

import requests
from lxml import html

headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0'
            }
url="https://journals.ametsoc.org/view/journals/mwre/131/5/mwre.131.issue-5.xml"
response=requests.get(url, headers=headers)
page = html.fromstring(response.text)
url_list = page.xpath("//h1/a[@class='c-Button--link']/@href")

for url in url_list:        
    url_half = url.replace('.xml','.pdf')
    url_base = "https://journals.ametsoc.org/downloadpdf"
    url_pdf= url_base+url_half
    filename = url_half.split('/')[-1]
    response = requests.get(url_pdf, headers=headers)
    if response.headers.get('content-type') == 'application/pdf':
        # Write the content to a PDF file
        with open(filename, 'wb') as file:
            file.write(response.content)
        print("PDF file downloaded successfully!")
    else:
        print("The response does not contain a PDF file.")

回答于 2025-04-12 由 Python大师

分享举报

从不支持通配符的网站下载所有PDF文件

2 个回答

撰写回答