Python2.7链接列表

def link_parser(soup,itemsList): for item in soup.findAll("div", { "class" : "tileInfo" }): for link in item.findAll("a", { "class" : "productClick productTitle" }): try: itemsList.put(removeNonAscii(html_parser.unescape(link.string)).replace(',',' ')+","+clean_a_url(link['href'])) except Exception: print "Formatting error: " traceback.print_exc(file=sys.stdout) return ""

1条回答

网友

1楼 · 发布于 2024-04-26 10:20:00

看起来你是想搜刮Target的网站-也许是this page。你知道吗

你已经遇到了一个基本的困难与网页刮-你看到的是不是总是你得到的。在这种情况下，在加载页面后，它们会在一堆内容中进行调整。注意第一次加载页面时的小风车动画-在页面上的各种js脚本运行之前，您试图访问的内容根本不存在于DOM中。（他们有一大堆）

我点击了一点，看起来负责生成内容的代码是jquery的这一部分：

   <script id="productTitleTmpl" type="text/x-jquery-tmpl" >
        {{if $item.parent.parent.viewType != "details"}}
            {{tmpl($data.itemAttributes) "#productBrandTmpl"}}
        {{/if}}
        <a class="productClick productTitle" id="prodTitle-{{= $item.parent.parent.viewType}}-{{= $item.parent.parent.currentPageNumber}}-{{= $item.parent.productCounter}}" href="/{{= productDetailPageURL}}#prodSlot={{= $item.parent.parent.viewType}}_{{= $item.parent.parent.currentPageNumber}}_{{= $item.parent.productCounter}}" title="{{= title}}" name="prodTitle_{{= $item.catalogEntryId}}">
            {{= $item.parent.parent.fetchProductTitleForView($item.productTitle)}}
        </a>

所以，不管怎样。如果你真的想抓取这个页面，你就需要放弃urllib（或者任何你用来获取html的东西）。相反，请使用支持javascript的无头浏览器（如selenium）访问此页面，让javascript运行，然后将其删除。所有这些都超出了这个答案的范围，但你可以在谷歌上搜索各种无头浏览器解决方案，并找到一个适合你的。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章