Python维基百科自动下载器

0 投票

3 回答

809 浏览

提问于 2025-04-16 13:32

[使用Python 3.1] 有人知道怎么让一个Python 3的程序，让用户写一个文本文件，里面有多个用逗号分开的单词吗？这个程序应该能读取这个文件，然后下载所请求的项目的维基百科页面。比如说，如果他们输入了hello,python-3,chicken，程序就会去维基百科下载 http://www.wikipedia.com/wiki/hello， http://www.wikip... 有人觉得能做到吗？

我说的“下载”是指下载文本内容，不包括图片。

文本处理编程实践文件读取网络请求维基百科信息提取数据抓取自动下载

3 个回答

看看下面的代码，它可以下载网页的内容，但不包括图片。不过，你可以通过解析的xml文件来获取图片的链接。

from time import sleep
import urllib
import urllib2
from xml.dom import minidom, Node

def main():
    print "Hello World"

    keywords = []

    key_file = open("example.txt", 'r')
    if key_file:
        temp_lines = key_file.readlines()

        for keyword_line in temp_lines:
            keywords.append(keyword_line.rstrip("\n"))

        key_file.close()

    print "Total keywords: %d" % len(keywords)
    for keyword in keywords:
        url = "http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=" + keyword
        xmldoc = minidom.parse(urllib.urlopen(url))
        root_node = xmldoc.childNodes[0]

        section_node = None
        for node in root_node.childNodes:
            if node.nodeType == Node.ELEMENT_NODE and \
            node.nodeName == "Section":
                section_node = node
                break

        if section_node is not None:
            items = []
            for node in section_node.childNodes:
                if node.nodeType == Node.ELEMENT_NODE and \
                node.nodeName == "Item":
                    items.append(node)

            if len(items) == 0:
                print "NO results found"
            else:
                print "\nResults found for " + keyword + ":\n"
                for item in items:
                    for node in item.childNodes:
                        if node.nodeType == Node.ELEMENT_NODE and \
                        node.nodeName == "Text":
                            if len(node.childNodes) == 1:
                                print node.childNodes[0].data.encode('utf-8')

                file_name = None
                for node in items[0].childNodes:
                    if node.nodeType == Node.ELEMENT_NODE and \
                    node.nodeName == "Text":
                        if len(node.childNodes) == 1:
                            file_name = "Html\%s.html" % node.childNodes[0].data.encode('utf-8')
                            break

                if file_name is not None:
                    file = open(file_name, 'w')
                    if file:
                        for node in items[0].childNodes:
                            if node.nodeType == Node.ELEMENT_NODE and \
                            node.nodeName == "Url":
                                if len(node.childNodes) == 1:
                                    user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)'
                                    header = { 'User-Agent' : user_agent }
                                    request = urllib2.Request(url=node.childNodes[0].data, headers=header)
                                    file.write(urllib2.urlopen(request).read())
                                    file.close()
                                    break


    print "Sleeping"
    sleep(2)

if __name__ == "__main__":
    main()

回答于 2025-04-16 由 Python大师

分享举报

你已经说得很清楚了，怎么制作这样的程序。那么问题是什么呢？

你只需要读取文件，按照逗号分开，然后下载网址。就这么简单！

回答于 2025-04-16 由 Python大师

分享举报

去查一下这个链接：urllib.request。

回答于 2025-04-16 由 Python大师

分享举报

Python维基百科自动下载器

3 个回答

撰写回答