使用lxml.etree解析Python Alexa结果

1 投票
1 回答
713 浏览
提问于 2025-04-18 10:52

我在使用AWS的Alexa API,但发现解析结果有点困难,想要得到我想要的信息。

Alexa API返回的是一个对象树,类型是<type 'lxml.etree._ElementTree'>

我用下面的代码来打印这个树:

from lxml import etree
root = tree.getroot()
print etree.tostring(root)

我得到了下面的XML:

<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"><aws:OperationRequest><aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId></aws:OperationRequest><aws:UrlInfoResult><aws:Alexa>

  <aws:ContentData>
    <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
    <aws:SiteData>
      <aws:Title>Google</aws:Title>
      <aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
      <aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
    </aws:SiteData>
    <aws:LinksInCount>3453627</aws:LinksInCount>
  </aws:ContentData>
  <aws:TrafficData>
    <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
    <aws:Rank>1</aws:Rank>
  </aws:TrafficData>
</aws:Alexa></aws:UrlInfoResult><aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:StatusCode>Success</aws:StatusCode></aws:ResponseStatus></aws:Response></aws:UrlInfoResponse>

我尝试用root.find('LinksInCount').text来获取元素的值,但这并没有成功。

我想知道怎么才能获取aws:LinksInCount中的文本3453627

1 个回答

3

你会遇到两个挑战:

  • 使用命名空间的XML
  • 两个命名空间共享相同的命名空间前缀

使用相同前缀的两个不同命名空间的XML文档

你会看到 "aws:" 这个前缀,但它被用于两个不同的命名空间:

xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"
xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"

在XML中使用相同的命名空间前缀是完全合法的。规则是,后面的那个才是有效的。

xmlstr = """
<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
  <aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
    <aws:OperationRequest>
      <aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId>
    </aws:OperationRequest>
    <aws:UrlInfoResult>
      <aws:Alexa>
        <aws:ContentData>
          <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
          <aws:SiteData>
            <aws:Title>Google</aws:Title>
            <aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
            <aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
          </aws:SiteData>
          <aws:LinksInCount>3453627</aws:LinksInCount>
        </aws:ContentData>
        <aws:TrafficData>
          <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
          <aws:Rank>1</aws:Rank>
        </aws:TrafficData>
      </aws:Alexa>
    </aws:UrlInfoResult>
    <aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
      <aws:StatusCode>Success</aws:StatusCode>
    </aws:ResponseStatus>
  </aws:Response>
</aws:UrlInfoResponse>
"""

下一个挑战是,如何搜索带命名空间的元素。

我更喜欢使用 xpath,在这个过程中,你可以在xpath表达式中使用任何你喜欢的命名空间,但你必须告诉 xpath 调用这些前缀的意思。这是通过 namespaces 字典来完成的:

from lxml import etree
doc = etree.fromstring(xmlstr.strip())

namespaces = {"aws": "http://awis.amazonaws.com/doc/2005-07-11"}
texts = doc.xpath("//aws:LinksInCount/text()", namespaces=namespaces)
print texts[0]

撰写回答