无页面结构知识的网络爬虫

8 投票

2 回答

4058 浏览

提问于 2025-04-18 07:52

我正在尝试通过写一个脚本来学习一个概念。简单来说，我想写一个Python脚本，给定几个关键词，它会爬取网页，直到找到我需要的数据。例如，我想找出生活在美国的有毒蛇的列表。我可能会用关键词list,venemous,snakes,US来运行我的脚本，并希望能有至少80%的把握，返回的结果是美国的蛇的列表。

我已经知道如何实现网络爬虫的部分，我只想学习如何在不知道网页结构的情况下判断一个网页的相关性。我研究过网页抓取的技术，但它们似乎都假设你对网页的HTML标签结构有了解。有没有什么算法可以让我从网页中提取数据并判断它的相关性呢？

任何建议都非常感谢。我正在使用Python，配合urllib和BeautifulSoup。

数据提取网络爬虫信息检索关键词搜索网页结构网页相关性抓取算法有毒蛇

2 个回答

你基本上是在问“我该怎么写一个搜索引擎。”这可不是件简单的事。

最好的办法是直接使用谷歌（或者必应、雅虎等）的搜索接口，显示前n个结果。不过，如果你只是想做一个个人项目来学习一些概念（虽然我不太确定具体是哪些概念），那么这里有一些建议：

在合适的标签中搜索文本内容，比如 <p>、<div> 等，找出相关的关键词（这很明显）。
用这些关键词检查可能包含你想要内容的标签。例如，如果你在找一个东西的列表，那么包含 <ul>、<ol> 或者 <table> 的页面可能是个不错的选择。
建立一个同义词词典，搜索每个页面中关键词的同义词。只限制在“美国”可能会导致一个只包含“America”的页面排名过低。
保持一个不在你关键词集合中的词汇列表，并给包含最多这些词的页面更高的排名。这些页面（可以说）更有可能包含你想要的答案。

祝你好运（你会需要的）！

回答于 2025-04-18 由 Python大师

分享举报

使用像scrapy这样的爬虫工具（主要是为了处理同时下载的任务），你可以写一个简单的爬虫，建议从维基百科开始，这里是一个不错的起点。这个脚本是一个完整的例子，使用了scrapy、nltk和whoosh。它会不停地运行，并且会索引链接，以便后续使用whoosh进行搜索。可以把它看作是一个小型的谷歌搜索引擎：

_Author = Farsheed Ashouri
import os
import sys
import re
## Spider libraries
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from main.items import MainItem
from scrapy.http import Request
from urlparse import urljoin
## indexer libraries
from whoosh.index import create_in, open_dir
from whoosh.fields import *
## html to text conversion module
import nltk

def open_writer():
    if not os.path.isdir("indexdir"):
        os.mkdir("indexdir")
        schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True))
        ix = create_in("indexdir", schema)
    else:
        ix = open_dir("indexdir")
    return ix.writer()

class Main(BaseSpider):
    name        = "main"
    allowed_domains = ["en.wikipedia.org"]
    start_urls  = ["http://en.wikipedia.org/wiki/Snakes"]
    
    def parse(self, response):
        writer = open_writer()  ## for indexing
        sel = Selector(response)
        email_validation = re.compile(r'^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$')
        #general_link_validation = re.compile(r'')
        #We stored already crawled links in this list
        crawledLinks    = set()
        titles = sel.xpath('//div[@id="content"]//h1[@id="firstHeading"]//span/text()').extract()
        contents = sel.xpath('//body/div[@id="content"]').extract()
        if contents:
            content = contents[0]
        if titles: 
            title = titles[0]
        else:
            return
        links   = sel.xpath('//a/@href').extract()

        
        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            url = urljoin(response.url, link)
            #print url
            ## our url must not have any ":" character in it. link /wiki/talk:company
            if not url in crawledLinks and re.match(r'http://en.wikipedia.org/wiki/[^:]+$', url):
                crawledLinks.add(url)
                  #print url, depth
                yield Request(url, self.parse)
        item = MainItem()
        item["title"] = title
        print '*'*80
        print 'crawled: %s | it has %s links.' % (title, len(links))
        #print content
        print '*'*80
        item["links"] = list(crawledLinks)
        writer.add_document(title=title, content=nltk.clean_html(content))  ## I save only text from content.
        #print crawledLinks
        writer.commit()
        yield item

回答于 2025-04-18 由 Python大师

分享举报

无页面结构知识的网络爬虫

2 个回答

撰写回答