如何使用jython/python元素在googlerefine中解析xml

<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:creator>J. Koenig</dc:creator> <dc:date>2010-01-13T15:47:38Z</dc:date> <dc:date>2010-01-13T15:47:38Z</dc:date> <dc:date>2010-01-13T15:47:38Z</dc:date> <dc:identifier>CCTL0059</dc:identifier> <dc:identifier>CCTL0059</dc:identifier> <dc:identifier>http://open.jorum.ac.uk:80/xmlui/handle/123456789/335</dc:identifier> <dc:format>application/pdf</dc:format> </oai_dc:dc>

3条回答

网友

1楼 · 编辑于 2024-05-23 22:42:26

以下是对J.F.Sebastian的版本稍作调整，可以直接粘贴到Google Refine中：

from xml.etree import ElementTree as ET
element = ET.fromstring(value)
namespace = "{http://purl.org/dc/elements/1.1/}"
return ','.join([e.text for e in element.getiterator(namespace+'identifier')])

它返回一个逗号分隔的列表，但是您可以更改return语句中使用的分隔符。在

网友

2楼 · 编辑于 2024-05-23 22:42:26

您使用了错误的命名空间。这适用于Jython 2.5.1：

from xml.etree import ElementTree as ET
element = ET.fromstring(value) # `value` is a string with the xml from question

namespace = "{http://purl.org/dc/elements/1.1/}"
for e in element.getiterator(namespace+'identifier'):
    print e.text

输出

^{pr2}$

网友

3楼 · 编辑于 2024-05-23 22:42:26

你可以用这样的GREL表达式，试试看：

forEach(value.parseHtml().select("dc|identifier"),v,v.htmlText()).join(",")

对于找到的每个标识符，给我htmlText并用逗号将它们连接起来。 parseHtml（）使用Jsoup.org网站库，并真正解析标记和结构。它还知道如何使用ns | identifier格式解析名称空间，在这种情况下，它是一种很好的方法来获取您的后续内容。在

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章