如何将xml节点和键值提取到数据框在R工作室，包括NA值？

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE chunkList SYSTEM "ccl.dtd"> <chunkList> <chunk id="ch1" type="p"> <sentence id="s1"> <tok> <orth>ktoś</orth> <lex disamb="1"><base>ktoś</base><ctag>subst:sg:nom:m1</ctag></lex> <prop key="polarity">0</prop> <prop key="sense:ukb:syns_id">11511</prop> <prop key="sense:ukb:syns_rank">11511/128.6156573170 243094/95.1234745165</prop> <prop key="sense:ukb:unitsstr">ktoś.2(15:os)</prop> </tok> <tok> <orth>go</orth> <lex disamb="1"><base>go</base><ctag>subst:sg:nom:n</ctag></lex> <prop key="polarity">0</prop> <prop key="sense:ukb:syns_id">47620</prop> <prop key="sense:ukb:syns_rank">47620/108.9010709884 234524/90.4766173102</prop> <prop key="sense:ukb:unitsstr">go.1(2:czy)</prop> </tok> <tok> <orth>krokodyl</orth> <lex disamb="1"><base>krokodyl</base><ctag>subst:sg:nom:m2</ctag></lex> <prop key="polarity">0</prop> <prop key="sense:ukb:syns_id">12879</prop> <prop key="sense:ukb:syns_rank">12879/40.5162836207 254796/35.9915058408 7063215/33.3657479890 7063214/26.6770712118 7063217/25.5775738130 7063213/23.6851347572 7063212/23.6300037076</prop> <prop key="sense:ukb:unitsstr">krokodyl.1(21:zw) krokodyl_właściwy.1(21:zw)</prop> </tok> <tok> <orth>się</orth> <lex disamb="1"><base>się</base><ctag>qub</ctag></lex> </tok> <tok> <orth>ja</orth> <lex disamb="1"><base>ja</base><ctag>ppron12:sg:nom:m1:pri</ctag></lex> </tok>

doc = xmlTreeParse("statsUCZESTxfreqkeyword xml.txt",useInternal = TRUE) top = xmlRoot(doc) xmlName(top) names(top) names( top[[ 1 ]] ) sent <- top[[ 1 ]] [[ "sentence" ]] names(sent) names(sent[[1]]) xmlSApply(sent[[1]], xmlValue) xmlSApply(sent, function(x) xmlSApply(x, xmlValue)) nodes = getNodeSet(top, "//prop[@key='sense:ukb:unitsstr']") lapply(nodes, function(x) xmlSApply(x, xmlValue)) # 152 words have prop xmlSApply(sent, function(x) xmlSApply(x, xmlValue))

1条回答

网友

1楼 · 发布于 2024-06-06 10:19:47

下面是一个使用xml2库的解决方案。我发现xml2的语法比xml库更简单。两者各有优缺点。
逻辑与我在这里提供的答案相似：rvest: Return NAs for empty nodes given multiple listings。代码的注释解释了每个步骤。在下面的代码中，xmltext是要处理的xml的xml文本或文件名。你知道吗

library(xml2)

#read the xml page
page<-read_xml(xmltext)
#find the listing nodes and id of each node
listings<-xml_find_all(page, ".//tok")

#find the text associated witht the ortho nodes
orthotext<-sapply(listings, function(x){xml_text(xml_find_first(x, ".//orth"))})

#find text associated with the prop key="sense:ukb:unitsstr"
ukb<-sapply(listings, function(x){ nodes<-xml_find_all(x, ".//prop")
                            #find node with wanted key
                           wantednode<-nodes[xml_attr(nodes, "key" )=="sense:ukb:unitsstr"]
                           #extract text
                           wantednode<-xml_text(wantednode)
                           #return NA if node is empty.
                           ifelse(is.character(wantednode), wantednode, NA)
})


#create dataframe
finalanswer<-data.frame(orthotext, ukb)

相关问题更多 >

编程相关推荐

热门问题

热门文章