使用python minidom过滤XML文件
我正在尝试使用Python的minidom来过滤一个XML文件。我想根据条件返回一个电子邮件地址的列表(<wd:Email_Address>),这个条件是地址必须是工作邮箱。我需要使用元素<wd:ID wd:type="Communication_Usage_Type_ID">WORK</wd:ID>来过滤这些邮箱地址。下面是文件内容:
<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="http://schema.xmlsoap.org/soap/envelope/">
<env:Body>
<wd:Get_Working_Response xmlns:wd="urn:com.workway/bsvc"
wd:version="v40.1">
<wd:Request_Criteria>
<wd:Transaction_Log_Criteria_Data>
</wd:Transaction_Log_Criteria_Data>
<wd:Field_And_Parameter_Criteria_Data>
</wd:Field_And_Parameter_Criteria_Data>
<wd:Eligibility_Criteria_Data>
</wd:Eligibility_Criteria_Data>
</wd:Request_Criteria>
<wd:Response_Filter>
</wd:Response_Filter>
<wd:Response_Group>
</wd:Response_Group>
<wd:Response_Results>
</wd:Response_Results>
<wd:Response_Data>
<wd:Worker>
<wd:Worker_Reference>
<wd:ID wd:type="WID">787878787878787</wd:ID>
<wd:ID wd:type="Employee_ID">123456</wd:ID>
</wd:Worker_Reference>
<wd:Worker_Descriptor>John Smith</wd:Worker_Descriptor>
<wd:Worker_Data>
<wd:Worker_ID>123456</wd:Worker_ID>
<wd:User_ID>jsmith</wd:User_ID>
<wd:Personal_Data>
<wd:Email_Address_Data>
<wd:Email_Address>jsmith2222@gmail.com</wd:Email_Address>
<wd:Usage_Data wd:Public="0">
<wd:Type_Data wd:Primary="1">
<wd:Type_Reference>
<wd:ID wd:type="WID">000000000000000</wd:ID>
<wd:ID wd:type="Communication_Usage_Type_ID">HOME</wd:ID>
</wd:Type_Reference>
</wd:Type_Data>
</wd:Usage_Data>
<wd:Email_Reference>
<wd:ID wd:type="WID">99999999999999999999999</wd:ID>
<wd:ID wd:type="Email_ID">EMAIL_REFERENCE-3-3960</wd:ID>
</wd:Email_Reference>
<wd:ID>EMAIL_REFERENCE-3-3960</wd:ID>
</wd:Email_Address_Data>
<wd:Email_Address_Data>
<wd:Email_Address>jsmith@something.com</wd:Email_Address>
<wd:Usage_Data wd:Public="1">
<wd:Type_Data wd:Primary="1">
<wd:Type_Reference>
<wd:ID wd:type="WID">999999999999999999999999999</wd:ID>
<wd:ID wd:type="Communication_Usage_Type_ID">WORK</wd:ID>
</wd:Type_Reference>
</wd:Type_Data>
</wd:Usage_Data>
<wd:Email_Reference>
<wd:ID wd:type="WID">999999999999999999999999</wd:ID>
<wd:ID wd:type="Email_ID">EMAIL_REFERENCE-3-4017</wd:ID>
</wd:Email_Reference>
<wd:ID>EMAIL_REFERENCE-3-4017</wd:ID>
</wd:Email_Address_Data>
</wd:Personal_Data>
</wd:Worker_Data>
</wd:Worker>
</wd:Response_Data>
</wd:Get_Working_Response>
</env:Body>
</env:Envelope>
到目前为止,我已经能够获得一个列表(workelements),里面包含了过滤后的工作邮箱的DOM元素。我觉得我需要利用这个列表来进一步过滤文件,并把结果放到一个新的列表(lNodesWithLevel2)中,这个列表只包含工作邮箱的Email_Address_Data元素。一旦我得到了这些元素,我就能获取Email_Address的值。如果有人能提供帮助,我将非常感激。如果有其他库能更简单地实现这个功能,我也很乐意尝试。以下是我目前的代码:
xmlDoc = minidom.parse('XML_Example.xml')
workelements =[]
lNodesWithLevel1 = xmlDoc.getElementsByTagName('wd:ID')
for mynodes in lNodesWithLevel1:
if mynodes.firstChild.nodeValue == 'WORK':
workelements.append(mynodes)
lNodesWithLevel2 = [lNode for lNode in xmlDoc.getElementsByTagName('wd:Email_Address_Data')
if lNode.getElementsByTagName('wd:ID') == li]
4 个回答
2
我想要根据条件返回一份电子邮件地址的列表(<wd:Email_Address>),这个条件是这些地址必须是工作邮箱。我需要用元素 <wd:ID wd:type="Communication_Usage_Type_ID">WORK</wd:ID> 来筛选出这些邮箱地址。
使用 Python 的 ElementTree 核心库
import xml.etree.ElementTree as ET
xml_data = '''<env:Envelope xmlns:env="http://schema.xmlsoap.org/soap/envelope/">
<env:Body>
<wd:Get_Working_Response xmlns:wd="urn:com.workway/bsvc"
wd:version="v40.1">
<wd:Request_Criteria>
<wd:Transaction_Log_Criteria_Data>
</wd:Transaction_Log_Criteria_Data>
<wd:Field_And_Parameter_Criteria_Data>
</wd:Field_And_Parameter_Criteria_Data>
<wd:Eligibility_Criteria_Data>
</wd:Eligibility_Criteria_Data>
</wd:Request_Criteria>
<wd:Response_Filter>
</wd:Response_Filter>
<wd:Response_Group>
</wd:Response_Group>
<wd:Response_Results>
</wd:Response_Results>
<wd:Response_Data>
<wd:Worker>
<wd:Worker_Reference>
<wd:ID wd:type="WID">787878787878787</wd:ID>
<wd:ID wd:type="Employee_ID">123456</wd:ID>
</wd:Worker_Reference>
<wd:Worker_Descriptor>John Smith</wd:Worker_Descriptor>
<wd:Worker_Data>
<wd:Worker_ID>123456</wd:Worker_ID>
<wd:User_ID>jsmith</wd:User_ID>
<wd:Personal_Data>
<wd:Email_Address_Data>
<wd:Email_Address>jsmith2222@gmail.com</wd:Email_Address>
<wd:Usage_Data wd:Public="0">
<wd:Type_Data wd:Primary="1">
<wd:Type_Reference>
<wd:ID wd:type="WID">000000000000000</wd:ID>
<wd:ID wd:type="Communication_Usage_Type_ID">HOME</wd:ID>
</wd:Type_Reference>
</wd:Type_Data>
</wd:Usage_Data>
<wd:Email_Reference>
<wd:ID wd:type="WID">99999999999999999999999</wd:ID>
<wd:ID wd:type="Email_ID">EMAIL_REFERENCE-3-3960</wd:ID>
</wd:Email_Reference>
<wd:ID>EMAIL_REFERENCE-3-3960</wd:ID>
</wd:Email_Address_Data>
<wd:Email_Address_Data>
<wd:Email_Address>jsmith@something.com</wd:Email_Address>
<wd:Usage_Data wd:Public="1">
<wd:Type_Data wd:Primary="1">
<wd:Type_Reference>
<wd:ID wd:type="WID">999999999999999999999999999</wd:ID>
<wd:ID wd:type="Communication_Usage_Type_ID">WORK</wd:ID>
</wd:Type_Reference>
</wd:Type_Data>
</wd:Usage_Data>
<wd:Email_Reference>
<wd:ID wd:type="WID">999999999999999999999999</wd:ID>
<wd:ID wd:type="Email_ID">EMAIL_REFERENCE-3-4017</wd:ID>
</wd:Email_Reference>
<wd:ID>EMAIL_REFERENCE-3-4017</wd:ID>
</wd:Email_Address_Data>
</wd:Personal_Data>
</wd:Worker_Data>
</wd:Worker>
</wd:Response_Data>
</wd:Get_Working_Response>
</env:Body>
</env:Envelope>
'''
# Parse the XML data
root = ET.fromstring(xml_data)
# Namespace dictionary
ns = {'wd': 'urn:com.workway/bsvc'}
for email_elem_root in root.findall('.//wd:Email_Address_Data', ns):
email = email_elem_root.find('./wd:Email_Address', ns).text
should_collect: bool = email_elem_root.find('.//wd:ID[@wd:type="Communication_Usage_Type_ID"]', ns).text == 'WORK'
if should_collect:
print("Collecting Email:", email)
else:
print("Ignoring Email:", email)
输出结果
Ignoring Email: jsmith2222@gmail.com
Collecting Email: jsmith@something.com
2
这里有一个使用lxml的例子。如果你打算处理XML文件,学习XPath会非常值得。
from lxml import etree
tree = etree.parse("input.xml")
xpath = "//wd:Email_Address_Data[wd:Usage_Data//wd:ID[@wd:type='Communication_Usage_Type_ID']='WORK']/wd:Email_Address"
work_emails = [email_elem.text for email_elem in tree.xpath(xpath, namespaces={"wd": "urn:com.workway/bsvc"})]
print(work_emails)
这段代码的输出结果是:
['jsmith@something.com']
2
使用 xml.dom.minidom
这个库,你可以做到:
import xml.dom.minidom
xmlDoc = xml.dom.minidom.parse('XML_Example.xml')
business = []
for email in xmlDoc.getElementsByTagName("wd:Email_Address_Data"):
for t in email.getElementsByTagName("wd:ID"):
if t.getAttribute("wd:type") == "Communication_Usage_Type_ID":
business_mail = t.firstChild.nodeValue
for m in email.getElementsByTagName("wd:Email_Address"):
if business_mail == "WORK":
business.append(m.firstChild.nodeValue)
print("WORK EMAILs:", business)
输出结果:
WORK EMAILs: ['jsmith@something.com']