如何使用rdflib(或普通sparql)访问rdf列表成员
访问rdf列表中的成员最好的方法是什么?我在使用rdflib(Python库),不过如果能给出简单的SPARQL答案也可以(这种答案可以通过rdfextras这个rdflib的辅助库来使用)。
我正在尝试访问由Zotero生成的某篇期刊文章的作者信息(为了简洁,有些字段被省略了):
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:z="http://www.zotero.org/namespaces/export#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:bib="http://purl.org/net/biblio#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/"
xmlns:link="http://purl.org/rss/1.0/modules/link/">
<bib:Article rdf:about="http://www.ncbi.nlm.nih.gov/pubmed/18273724">
<z:itemType>journalArticle</z:itemType>
<dcterms:isPartOf rdf:resource="urn:issn:0954-6634"/>
<bib:authors>
<rdf:Seq>
<rdf:li>
<foaf:Person>
<foaf:surname>Lee</foaf:surname>
<foaf:givenname>Hyoun Seung</foaf:givenname>
</foaf:Person>
</rdf:li>
<rdf:li>
<foaf:Person>
<foaf:surname>Lee</foaf:surname>
<foaf:givenname>Jong Hee</foaf:givenname>
</foaf:Person>
</rdf:li>
<rdf:li>
<foaf:Person>
<foaf:surname>Ahn</foaf:surname>
<foaf:givenname>Gun Young</foaf:givenname>
</foaf:Person>
</rdf:li>
<rdf:li>
<foaf:Person>
<foaf:surname>Lee</foaf:surname>
<foaf:givenname>Dong Hun</foaf:givenname>
</foaf:Person>
</rdf:li>
<rdf:li>
<foaf:Person>
<foaf:surname>Shin</foaf:surname>
<foaf:givenname>Jung Won</foaf:givenname>
</foaf:Person>
</rdf:li>
<rdf:li>
<foaf:Person>
<foaf:surname>Kim</foaf:surname>
<foaf:givenname>Dong Hyun</foaf:givenname>
</foaf:Person>
</rdf:li>
<rdf:li>
<foaf:Person>
<foaf:surname>Chung</foaf:surname>
<foaf:givenname>Jin Ho</foaf:givenname>
</foaf:Person>
</rdf:li>
</rdf:Seq>
</bib:authors>
<dc:title>Fractional photothermolysis for the treatment of acne scars: a report of 27 Korean patients</dc:title>
<dcterms:abstract>OBJECTIVES: Atrophic post-acne scarring remains a therapeutically challe *CUT*, erythema and edema. CONCLUSIONS: The 1550-nm erbium-doped FP is associated with significant patient-reported improvement in the appearance of acne scars, with minimal downtime.</dcterms:abstract>
<bib:pages>45-49</bib:pages>
<dc:date>2008</dc:date>
<z:shortTitle>Fractional photothermolysis for the treatment of acne scars</z:shortTitle>
<dc:identifier>
<dcterms:URI>
<rdf:value>http://www.ncbi.nlm.nih.gov/pubmed/18273724</rdf:value>
</dcterms:URI>
</dc:identifier>
<dcterms:dateSubmitted>2010-12-06 11:36:52</dcterms:dateSubmitted>
<z:libraryCatalog>NCBI PubMed</z:libraryCatalog>
<dc:description>PMID: 18273724</dc:description>
</bib:Article>
<bib:Journal rdf:about="urn:issn:0954-6634">
<dc:title>The Journal of Dermatological Treatment</dc:title>
<prism:volume>19</prism:volume>
<prism:number>1</prism:number>
<dcterms:alternative>J Dermatolog Treat</dcterms:alternative>
<dc:identifier>DOI 10.1080/09546630701691244</dc:identifier>
<dc:identifier>ISSN 0954-6634</dc:identifier>
</bib:Journal>
3 个回答
这个问题虽然很久了,但为了完整性还是说一下:访问RDF的序列(Seq)或列表(List)最好用SPARQL加上过滤器来解决。
SELECT ?container ?member
WHERE {
?container ?prop ?member.
FILTER(?prop == rdfs:member ||
regexp(str(?prop),
"^http://www.w3.org/1999/02/22-rdf-syntax-ns#_[0-9]+$"))
}
这和manuel-salvadores的示例2基本相同,不过你应该更好地限制他的变量?seq_index
(相当于我的?prop
),让它只包含相关的属性。
既然你也提到了RDF列表,那么在这种情况下,SPARQL 1.1的查询是
SELECT ?list ?member
WHERE {
?list rdf:rest*/rdf:first ?member.
}
在新版的 RDFLib
中,访问集合变得更加简单了。现在,你可以通过 Seq
类来编程地访问序列中的成员:
from rdflib import *
from rdflib.graph import Seq
from rdflib.namespace import FOAF
BIB = Namespace("http://purl.org/net/biblio#")
# Load data
g = Graph()
g.parse(file=open("./zotero.rdf", "r"), format="application/rdf+xml")
# Get the first resource linked to article via bib:authors
article = URIRef("http://www.ncbi.nlm.nih.gov/pubmed/18273724")
authors = g.objects(article, BIB.authors).__next__()
i = 1
for author in Seq(g, authors):
givenname = g.triples((author, FOAF.givenname, None)).__next__()[2]
surname = g.triples((author, FOAF.surname, None)).__next__()[2]
print("%i: %s %s" % (i, str(givenname), str(surname)))
i += 1
处理rdf容器通常很麻烦,让人感到很烦。这里我分享两个解决方案,一个不使用SPARQL,另一个使用SPARQL。个人来说,我更喜欢第二个,也就是使用SPARQL的那个。
例子1:不使用SPARQL
如果你想获取某篇文章的所有作者,可以参考下面的代码。
我在代码中添加了注释,方便理解。最重要的是使用了g.triple(triple_pattern)
这个图形函数,基本上你可以通过它来过滤一个rdflib图形,并搜索你需要的三元组模式。
当解析一个rdf:Seq时,会生成如下形式的谓词:
http://www.w3.org/1999/02/22-rdf-syntax-ns#_1
http://www.w3.org/1999/02/22-rdf-syntax-ns#_2
http://www.w3.org/1999/02/22-rdf-syntax-ns#_3
rdflib会随机获取这些谓词,所以你需要对它们进行排序,以便按照正确的顺序遍历。
import rdflib
RDF = rdflib.namespace.RDF
#Parse the file
g = rdflib.Graph()
g.parse("zot.rdf")
#So that we are sure we get something back
print "Number of triples",len(g)
#Couple of handy namespaces to use later
BIB = rdflib.Namespace("http://purl.org/net/biblio#")
FOAF = rdflib.Namespace("http://xmlns.com/foaf/0.1/")
#Author counter to print at the bottom
i=0
#Article for wich we want the list of authors
article = rdflib.term.URIRef("http://www.ncbi.nlm.nih.gov/pubmed/18273724")
#First loop filters is equivalent to "get all authors for article x"
for triple in g.triples((article,BIB["authors"],None)):
#This expresions removes the rdf:type predicate cause we only want the bnodes
# of the form http://www.w3.org/1999/02/22-rdf-syntax-ns#_SEQ_NUMBER
# where SEQ_NUMBER is the index of the element in the rdf:Seq
list_triples = filter(lambda y: RDF['type'] != y[1], g.triples((triple[2],None,None)))
#We sort the authors by the predicate of the triple - order in sequences do matter ;-)
# so "http://www.w3.org/1999/02/22-rdf-syntax-ns#_435"[44:] returns 435
# and since we want numberic order we do int(x[1][44:]) - (BTW x[1] is the predicate)
authors_sorted = sorted(list_triples,key=lambda x: int(x[1][44:]))
#We iterate the authors bNodes and we get surname and givenname
for author_bnode in authors_sorted:
for x in g.triples((author_bnode[2],FOAF['surname'],None)):
author_surname = x[2]
for y in g.triples((author_bnode[2],FOAF['givenname'],None)):
author_name = y[2]
print "author(%s): %s %s"%(i,author_name,author_surname)
i += 1
这个例子展示了如何在不使用SPARQL的情况下做到这一点。
例子2:使用SPARQL
现在是完全相同的例子,但这次使用了SPARQL。
rdflib.plugin.register('sparql', rdflib.query.Processor,
'rdfextras.sparql.processor', 'Processor')
rdflib.plugin.register('sparql', rdflib.query.Result,
'rdfextras.sparql.query', 'SPARQLQueryResult')
query = """
SELECT ?seq_index ?name ?surname WHERE {
<http://www.ncbi.nlm.nih.gov/pubmed/18273724> bib:authors ?seq .
?seq ?seq_index ?seq_bnode .
?seq_bnode foaf:givenname ?name .
?seq_bnode foaf:surname ?surname .
}
"""
for row in sorted(g.query(query, initNs=dict(rdf=RDF,foaf=FOAF,bib=BIB)),
key=lambda x:int(x[0][44:])):
print "Author(%s) %s %s"%(row[0][44:],row[1],row[2])
正如所示,我们仍然需要进行排序,因为这个库不会自动处理。在查询中,变量seq_index
保存了包含序列顺序信息的谓词,这个变量就是在lambda函数中进行排序的依据。