从XML中提取数据 python

5 投票
2 回答
7415 浏览
提问于 2025-04-17 11:04

我正在尝试浏览Google的XML文件,以获取大约6个字段的信息。我使用Google提供的gdata来提取我Google应用域中用户资料的XML数据。以下是我得到的结果:

<?xml version="1.0"?>
-<ns0:feed ns1:etag="W/"LIESANDCRAPfyt7I2A9WhHERE."" xmlns:ns4="http://www.w3.org/2007/app" xmlns:ns3="http://schemas.google.com/contact/2008" xmlns:ns2="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:ns1="http://schemas.google.com/g/2005" xmlns:ns0="http://www.w3.org/2005/Atom">
    <ns0:updated>2012-01-25T14:52:12.867Z</ns0:updated>
    <ns0:category term="http://schemas.google.com/contact/2008#profile" scheme="http://schemas.google.com/g/2005#kind"/>
    <ns0:id>domain.com</ns0:id>
    <ns0:generator version="1.0" uri="http://www.google.com/m8/feeds">Contacts</ns0:generator>
    <ns0:author>
        <ns0:name>domain.com</ns0:name>
    </ns0:author>
    <ns0:link type="text/html" rel="alternate" href="http://www.google.com/"/>
    <ns0:link type="application/atom+xml" rel="http://schemas.google.com/g/2005#feed" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full"/>
    <ns0:link type="application/atom+xml" rel="http://schemas.google.com/g/2005#batch" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full/batch"/>
    <ns0:link type="application/atom+xml" rel="self" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full?max-results=300"/>
    <ns2:startIndex>1</ns2:startIndex>
    <ns2:itemsPerPage>300</ns2:itemsPerPage>
    <ns0:entry ns1:etag=""CRAPQR4KTit7I2A4"">
        <ns0:category term="http://schemas.google.com/contact/2008#profile" scheme="http://schemas.google.com/g/2005#kind"/>
        <ns0:id>http://www.google.com/m8/feeds/profiles/domain/domain.com/full/nperson</ns0:id>
        <ns1:name>
            <ns1:familyName>Person</ns1:familyName>
            <ns1:fullName>Name Person</ns1:fullName>
            <ns1:givenName>Name</ns1:givenName>
        </ns1:name>
        <ns0:updated>2012-01-25T14:52:13.081Z</ns0:updated>
        <ns1:organization rel="http://schemas.google.com/g/2005#work" primary="true">
            <ns1:orgTitle>JobField</ns1:orgTitle>
            <ns1:orgDepartment>DepartmentField</ns1:orgDepartment>
            <ns1:orgName>CompanyField</ns1:orgName>
        </ns1:organization>
        <ns3:status indexed="true"/>
        <ns0:title>Name Person</ns0:title>
        <ns0:link type="image/*" rel="http://schemas.google.com/contacts/2008/rel#photo" href="https://www.google.com/m8/feeds/photos/profile/domain.com/nperson"/>
        <ns0:link type="application/atom+xml" rel="self" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full/nperson"/>
        <ns0:link type="application/atom+xml" rel="edit" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full/nperson"/>
        <ns1:email rel="http://schemas.google.com/g/2005#other" address="nperson@gapps.domain.com"/>
        <ns1:email rel="http://schemas.google.com/g/2005#other" primary="true" address="nperson@domain.com"/>
        <ns4:edited>2012-01-25T14:52:13.081Z</ns4:edited>
    </ns0:entry>
    <ns0:title>domain.com's Profiles</ns0:title>
</ns0:feed>

我想用lxml来解析这些数据,但效果不是很好,这是我的代码:

import atom
import gdata.auth
import gdata.contacts
import gdata.contacts.client
from lxml import etree
from lxml import objectify

email = 'admin@domain.com'
password = 'password'
domain = 'domain.com'

gd_client = gdata.contacts.client.ContactsClient(domain=domain)
gd_client.ClientLogin(email, password, 'profileFeedAPI')

profiles_feed = gd_client.GetProfilesFeed('https://www.google.com/m8/feeds/profiles/domain/domain.com/full?max-results=300')

def PrintFeed(feed):
  for i, entry in enumerate(feed.entry):
    print '\n%s %s' % (i+1, entry.title.text)

print(profiles_feed)
PrintFeed(profiles_feed)

profiles_feed2=(str(profiles_feed))

root = objectify.fromstring(profiles_feed2)

print root

print root.tag
print root.text

for e in root.entry():
    print e.tag
    print e.text

我能获取到feed和entry,但无法进一步探索。我只需要ns1 name中的名字字段和ns1 organization中的组织字段的文本内容。我有点迷茫,所以任何帮助都非常感谢。

2 个回答

1

你可以试试用 Xpath 表达式 和 lxml,这样会让你的工作轻松很多。

比如说,如果你的 XML 文件是:

<document>
        <name>
                <familyName>Person</familyName>
                <fullName>Name Person</fullName>
                <givenName>Name</givenName>
        </name>
</document>

那么下面的代码

>>> import lxml
>>> from lxml import etree
>>> et = etree.parse("test.xml")
>>> value = et.xpath("/document/name/*/text()")
>>> value
['Person', 'Name Person', 'Name']

要使用 xpath 的话,可以用火狐浏览器的 firebug 插件。

2

我总是推荐使用 BeautifulSoup,因为它的接口非常简单,容易上手:

from BeautifulSoup import BeautifulStoneSoup as Soup

soup = Soup(open(filename))
for tag in soup.findAll('ns1:name'):
    print tag.find('ns1:familyname').text
    print tag.find('ns1:fullname').text
    print tag.find('ns1:givenname').text
for tag in soup.findAll('ns1:organization'):
    print tag.find('ns1:orgtitle').text
    print tag.find('ns1:orgdepartment').text
    print tag.find('ns1:orgname').text

示例输出:

Person
Name Person
Name
JobField
DepartmentField
CompanyField

撰写回答