使用Java从MediaWiki读取XML标记
我需要从下面的url usign Java读取“搜索”标记的输出
首先,我需要从以下URL将XML读入一些字符串: http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother
我应该有这样的结局:
<api>
<query-continue>
<search sroffset="1"/>
</query-continue>
<query>
<searchinfo totalhits="55180"/>
<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>
</query>
</api>
一旦我有了XML,我需要获取搜索标签的内容: “search”标记的输出如下所示,我需要从中间的代码中获得两部分:
<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>
最后,我只需要有两个字符串,这将等于:
String title = Big Brothers Big Sisters of America
String snippet = "<span class='searchmatch'>Big..."
有人能帮我修改一下代码吗?我不知道我做错了什么。我认为它甚至不能从url中检索XML,更不用说XML中的标记了
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother");
doc.getDocumentElement().normalize();
XPathFactory xFactory = XPathFactory.newInstance();
XPath xpath = xFactory.newXPath();
XPathExpression expr = xpath.compile("//query/search/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i=0; i<nodes.getLength();i++){
System.out.println(nodes.item(i).getNodeValue());
}
抱歉,我是新手,在任何地方都找不到答案
# 1 楼答案
这里的主要问题是,您要求的文本节点是
<search>
的子节点,但实际上您想要的<p ..>
不是文本节点:它是一个元素。(事实上,<search>
元素没有文本节点子元素,当您使用“查看源代码”查看来自该URL的响应时,可以看出这一点。)所以,您要做的是将XPath表达式更改为
这将为您提供
p
元素节点。然后在Java代码中询问该节点的两个属性title
和snippet
的值:或者,可以执行两个XPath查询,每个属性一个:
及
假设只有一个
<p>
元素。如果在多个<p>
元素上执行此操作,则可能希望将每对属性保持在一起,而不是有两个单独的结果列表