在Python中使用scrapy获取链接？

1 投票

2 回答

726 浏览

提问于 2025-04-17 00:19

抱歉，如果这个问题听起来很傻，但我真的不知道怎么使用Scrapy。我不想创建一个Scrapy爬虫（或者其他什么），我想把它融入到我现有的代码中。我看过文档，但觉得有点 confusing。

我需要做的是，从网站上的一个列表中获取链接。我只需要一个例子来更好地理解这个过程。另外，能不能用一个for循环对每个列表项做点什么？它们的顺序是这样的：

<ul>
  <li>example</li>
</ul>

谢谢！

网络爬虫编程示例数据抓取 for循环链接提取 scrapy

2 个回答

也许如果事情这么简单，你就不需要用到scrapy了。

cat local.html

<html><body>
<ul>  
<li>example</li>  
<li>example2</li>
</ul>
<div><a href="test">test</a><div><a href="hi">hi</a></div></div>
</body></html>

那么……

import urllib2
from lxml import html

page =urllib2.urlopen("file:///root/local.html")
root = html.parse(page).getroot()
details = root.cssselect("li")
for x in details:
        print(x.text_content())

for x in root.xpath('//a/@href'):
        print(x)

回答于 2025-04-17 由 Python大师

分享举报

你可以考虑使用BeautifulSoup，它非常适合解析HTML和XML文件，它的文档也很有帮助。获取链接的代码大概是这样的：

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        print link['href']

SoupStrainer可以让你只提取链接，而不需要解析整个文档，这样会更高效。

编辑：刚看到你需要使用Scrapy。虽然我没用过这个工具，但你可以看看它的官方文档，看起来里面有你需要的信息。

回答于 2025-04-17 由 Python大师

分享举报

在Python中使用scrapy获取链接？

2 个回答

撰写回答