如何获取网站的html文件

3条回答

网友

1楼 · 编辑于 2024-04-20 09:05:48

对每个文件使用.read()两次。你知道吗

>>> f.read()
'This is the entire file.\n'
>>> f.read()
''

“如果已经到达文件的结尾，f.read（）将返回一个空字符串（""）。”（7.2.1Docs）。你知道吗

因此，当比较两个结果时，它们是相等的，因为每个结果都是空字符串。你知道吗

网友

2楼 · 编辑于 2024-04-20 09:05:48

首次打印联机和脱机页面时，请使用以下行：

print(onlinepage.read())
print(offlinepage.read())

…您现在已经使用了每个文件对象中的所有文本。对任一对象的后续读取都将返回空字符串。两个空字符串相等，因此if条件的计算结果总是True。你知道吗

如果您纯粹是处理文件，则可以^{}到两个文件的开头并重新读取。由于在来自urlopen的file对象上没有seek方法，因此您需要使用新的urlopen命令重新获取页面，或者最好将原始文本保存在变量中，并将其用于后续比较：

online = onlinepage.read()
print(online)
offline = offlinepage.read()
print(offline)

...

if online == offline:
    ...

网友

3楼 · 编辑于 2024-04-20 09:05:48

正如其他人所指出的，您不能read两次请求对象（也不能read两次不查找文件）；一旦读取，您得到的数据就不再可用，因此您需要存储它。你知道吗

但是他们忽略了另一个问题：您以w+模式打开了文件。w+允许读写，但与模式w一样，它会截断open上的文件。因此，当您读取本地文件时，它总是空的，这意味着您既破坏了本地文件，也从未获得匹配项（除非联机文件也是空的）。你知道吗

您需要使用模式r+或a+来获取不截断现有文件的读/写句柄（r+要求文件已经存在，a+不存在，但将写位置放在文件末尾，在某些系统上，所有写操作都放在文件末尾）。你知道吗

因此，修复这两个错误，您将得到：

import urllib
url = "https://www.mywebsite.com/"
# Using with statements properly for safe resource cleanup
with urllib.urlopen(url) as onlinepage:
    onlinedata = onlinepage.read()
print(onlinedata)

with open("offline.txt", "r+") as offlinepage:  # DOES NOT TRUNCATE EXISTING FILE!
    offlinedata = offlinepage.read()
    print(offlinedata)

    if onlinedata == offlinedata:
        print("same") # for debugging
    else:
        print("different")
        # I assume you want to rewrite the local page, or you wouldn't open with +
        # so this is what you'd do to ensure you replace the existing data correctly
        offlinepage.seek(0)     # Ensure you're seeked to beginning of file for write
        offlinepage.write(onlinedata)
        offlinepage.truncate()  # If online data smaller, don't keep offline extra data

相关问题更多 >

编程相关推荐

热门问题

热门文章