在Python中使用xmltodict访问标记内的行

2024-05-23 20:23:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个xml文件,看起来像:

<!-- For the full list of available Crowd HTML Elements and their input/output documentation,
      please refer to https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html -->

<!-- You must include crowd-form so that your task submits answers to MTurk -->
<crowd-form answer-format="flatten-objects">

    <!-- The crowd-classifier element will create a tool for the Worker to
 select the correct answer to your question.
          Your image file URLs will be substituted for the "image_url" variable below

          when you publish a batch with a CSV input file containing multiple image file URLs.

          To preview the element with an example image, try setting the src attribute to

          "https://s3.amazonaws.com/cv-demo-images/two-birds.jpg" -->
<crowd-image-classifier\n        
src= "https://someone@example.com/abcd.jpg"\n        
categories="[\'Yes\', \'No\']"\n        
header="abcd"\n        
name="image-contains">\n\n       
<!-- Use the short-instructions section for quick instructions that the Worker\n
will see while working on the task. Including some basic examples of\n              
good and bad answers here can help get good results. You can include\n              
any HTML here. -->\n        
<short-instructions>\n\n        
</crowd-image-classifier>
</crowd-form>
<!-- YOUR HTML ENDS -->

我要提取行:

src = https://someone@example.com/abcd.jpg

并将其赋给python中的一个变量。 xml解析的新功能:

我试着说:

hit_doc = xmltodict.parse(get_hit['HIT']['Question'])
image_url = hit_doc['HTMLQuestion']['HTMLContent']['crowd-form']['crowd-image-classifier']

错误:

    image_url = hit_doc['HTMLQuestion']['HTMLContent']['crowd-form']['crowd-image-classifier']
TypeError: string indices must be integers

如果我不在代码中访问['crowd-image-classifier'],并限制自己

hit_doc = xmltodict.parse(get_hit['HIT']['Question'])
image_url = hit_doc['HTMLQuestion']['HTMLContent']

然后我得到完整的xml文件。你知道吗

如何访问img src?你知道吗


Tags: thetohttpsimageformsrccomurl
2条回答

你可以使用BeautifulSoup。请参阅下面的工作代码。你知道吗

from bs4 import BeautifulSoup


html = '''<!  For the full list of available Crowd HTML Elements and their input/output documentation,
      please refer to https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html  >

<!  You must include crowd-form so that your task submits answers to MTurk  >
<crowd-form answer-format="flatten-objects">

    <!  The crowd-classifier element will create a tool for the Worker to
 select the correct answer to your question.
          Your image file URLs will be substituted for the "image_url" variable below

          when you publish a batch with a CSV input file containing multiple image file URLs.

          To preview the element with an example image, try setting the src attribute to

          "https://s3.amazonaws.com/cv-demo-images/two-birds.jpg"  >
<crowd-image-classifier\n        
src= "https://someone@example.com/abcd.jpg"\n        
categories="[\'Yes\', \'No\']"\n        
header="abcd"\n        
name="image-contains">\n\n       
<!  Use the short-instructions section for quick instructions that the Worker\n
will see while working on the task. Including some basic examples of\n              
good and bad answers here can help get good results. You can include\n              
any HTML here.  >\n        
<short-instructions>\n\n        
</crowd-image-classifier>
</crowd-form>
<!  YOUR HTML ENDS  >'''

soup = BeautifulSoup(html, 'html.parser')
element = soup.find('crowd-image-classifier')
print(element['src'])

输出

https://someone@example.com/abcd.jpg

我转而使用xml元素树

我得到的语法有点类似于:

import xml.etree.ElementTree as ET
root = ET.fromstring(hit_doc)
for child in root:
    if child[0].text == 'crowd-image-classifier':
    image_data = child[1].text

相关问题 更多 >