Beautiful Soup 问题

-2 投票

3 回答

722 浏览

提问于 2025-04-16 15:53

最近有人推荐我在Python项目中使用Beautiful Soup。我在Beautiful Soup的官网上看了一些文档，但对我想做的事情还是搞不清楚。我有一个页面，上面有很多链接。这是一个包含链接和文件大小的目录。假设它看起来像这样：


Parent Directory/       -   Directory
game1.tar.gz    2010-May-24 06:51:39    8.2K    application/octet-stream
game2.tar.gz    2010-Jun-19 09:09:34    542.4K  application/octet-stream
game3.tar.gz    2011-Nov-13 11:53:01    5.5M    application/octet-stream

我想做的是输入一个搜索字符串，比如说game2，然后希望它能下载game2.tar.gz。我本来想用正则表达式，但听说Beautiful Soup要好得多。有没有人能给我演示一下，解释一下我该怎么做？

正则表达式文件下载 beautiful soup 网页解析数据抓取链接提取

3 个回答

在YouTube上有很多关于安装和使用Beautiful Soup 4来进行“抓取”的视频。这些视频讲解得很详细。我还在慢慢看这些视频，但第一个视频让我成功安装并开始使用了。
你可以在YouTube上搜索“Beautiful Soup”。

回答于 2025-04-16 由 Python大师

分享举报

你的问题不是很清楚。

根据你提供的数据，我觉得你只需要做：

content = '''Parent Directory/       -   Directory
game1.tar.gz    2010-May-24 06:51:39    8.2K    application/octet-stream
game2.tar.gz    2010-Jun-19 09:09:34    542.4K  application/octet-stream
game3.tar.gz    2011-Nov-13 11:53:01    5.5M    application/octet-stream'''


def what_dir(x, content):
    for line in content.splitlines():
        if x in line.split(None,1)[0]:
            return line.split(None,1)[0]

编辑

这样对你有帮助吗？：

import urllib
import re

sock = urllib.urlopen('http://pastie.org/pastes/1801547/reply')
content = sock.read()
sock.close()

spa = re.search('<textarea class="pastebox".+?</textarea>',content,re.DOTALL).span()

regx = re.compile('href=&quot;(.+?)&quot;&gt;\\1&lt;')

print regx.findall(content,*spa)

编辑 2

或者这就是你想要的？：

import urllib
import re

sock = urllib.urlopen('http://pastie.org/pastes/1801547/reply')
content = sock.read()
sock.close()

spa = re.search('<textarea class="pastebox".+?</textarea>',content,re.DOTALL).span()
regx = re.compile('href=&quot;(.+?)&quot;&gt;\\1&lt;')
dic = dict((name.split('.')[0],'http://pastie.org/pastes/1801547/'+name)
           for name in regx.findall(content,*spa))
print dic

结果

{'game3': 'http://pastie.org/pastes/1801547/game3.tar.gz',
 'game2': 'http://pastie.org/pastes/1801547/game2.tar.gz',
 'game1': 'http://pastie.org/pastes/1801547/game1.tar.gz'}

回答于 2025-04-16 由 Python大师

分享举报

这段代码的意思是……

首先，它会做一些准备工作，比如设置一些变量或者加载必要的库。接着，它会执行一些操作，比如循环、条件判断等，来处理数据或实现某个功能。最后，它会输出结果，可能是打印到屏幕上，或者保存到文件中。

总的来说，这段代码就是在告诉计算机要做什么，按照设定的步骤一步一步地完成任务。

from BeautifulSoup import BeautifulSoup  
import urllib2

def searchLinks(url, query_string):
    f = urllib2.urlopen(url)
    soup = BeautifulSoup(f, convertEntities='html')
    for a in soup.findAll('a'):
        if a.has_key('href'):
            idx = a.contents[0].find(query_string)
            if idx is not None and idx > -1:
                yield a['href'] 

res = list(searchLinks('http://example.com', 'game2'))
print res

回答于 2025-04-16 由 Python大师

分享举报

Beautiful Soup 问题

3 个回答

编辑

编辑 2

撰写回答