如何在网页中查找字符串而不保存为文件？

0 投票

2 回答

3837 浏览

提问于 2025-04-17 04:59

我刚开始学习Python，有一些问题想请教！！

def extractdownloadurl(url):

    uresponse = urllib2.urlopen(url) #open url
    contents = uresponse.readlines() #readlines from url file
    fo = open("test.html","w") #open test.html
    for line in contents: 
        fo.write(line)#write lines from url file to text file
    fo.close()#close text file

    cadena = os.system('more test.html | grep uploads | grep zip >> cadena.html')

    f = open("cadena.html","r")
    text = f.read()
    f.close()


    match = re.search(r'href=[\'"]?([^\'" >]+)', text)
    if match:
        cadena=match.group(0)


    texto = cadena[6:]


    os.system('rm test.html')
    os.system('rm cadena.html')
    return texto

这是我用来下载网页并根据一些条件提取一个网址的函数。它能正常工作。但是我想找一种比把网页保存到文件里更有效的方法。我想做一些类似于grep的操作，但不想保存和读取文件（这样真的很慢）。还有一种更快的方法是把网址复制到一个字符串里。

请给我写一段代码，让我可以在内容中查找网址，而不需要把内容保存到文件里。

我知道有很多问题，但如果你能回答我所有的问题，我会非常感激。

正则表达式编程技巧数据提取网页抓取字符串查找内存处理 grep操作

2 个回答

更新了Lycha的回答，适用于python3

import re, urllib.request
page = urllib.request.urlopen("http://sebsauvage.net/index.html").read().decode('utf-8')
urls = re.findall('href=[\'"]?([^\'" >]+)', page)
for url in urls:
    print(url)

回答于 2025-04-17 由 Python大师

分享举报

这个脚本可以帮助你前进。它会使用你写的正则表达式，从网页上打印出所有的链接：

import re, urllib
page = urllib.urlopen("http://sebsauvage.net/index.html").read()
urls = re.findall('href=[\'"]?([^\'" >]+)',page)
for url in urls:
    print url

回答于 2025-04-17 由 Python大师

分享举报

如何在网页中查找字符串而不保存为文件？

2 个回答

撰写回答