句子。拆分获取网站页面

import urllib2 import string proxy = urllib2.ProxyHandler({"http" : "http://c99.cache.e2bn.org:8084"}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) tvCatchup = urllib2.urlopen('http://www.TVcatchup.com') html = tvCatchup.read() firstSplit = html.split('<a class="enabled" href="/watch.html?c=')[1:] for i in firstSplit: print i secondSplit = html.split ('1" title="BBC One"></a></li><li class="v-type" style="color:#6d6d6d;">')[1:] for i in secondSplit: print i

1条回答

网友

1楼 · 发布于 2024-04-19 03:34:58

您通常会使用html parser（参见Python HTMLParser的示例）来实现这一点。（人们也经常使用^{}）。使用split是可能的，但是有点不太成熟。。。我还是做了。在最初将页面分割成大段之后，下一步是循环浏览这些页面，并将它们分割成更小的段，磨练您想要的信息。你知道吗

big_parts = html.split('href="/watch.html?c=')[1:]
for n, part in enumerate(big_parts):
    small_part = part.split('</a>')[0]
    if n % 2:       # odd numbered segments
        programme = small_part.split('"> ')[1]
        print programme
    else:           # even numbered segments
        smaller_parts = small_part.split('"')
        number = smaller_parts[0]
        channel = smaller_parts[2]
        print number, channel, ':',

它之所以有效，是因为在href="/watch.html?c=和</a>之间查找文本时，恰好标识了同时包含频道名和节目名的所有段。然后可以使用识别字符序列（">和"）来分解这些段，以获得所需的确切信息。如果网站改变了它的HTML样式，这可能会停止工作。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章