我想在R或Python中合并两个表,每个表都有上万行。但是,我无法在完美匹配上进行合并。我正在寻找一个键是另一个键的子串的情况。匹配的子字符串可以包含多个单词。我正在寻找一种比下面的暴力代码更快的解决方案。
https://stackoverflow.com/users/170352/brandon-bertelsen根据我最初建议的玩具数据给出了一个很好的答案。但是,它只匹配单字子字符串。(我最初并没有明确提出这个要求。)
这是我在这种情况下使用的代码。你知道吗
library(SPARQL)
library(parallel)
library(Hmisc)
library(tidyr)
library(dplyr)
my.endpoint <- "http://sparql.hegroup.org/sparql/"
go.query <- 'select *
where { graph <http://purl.obolibrary.org/obo/merged/GO>
{ ?goid
<http://www.geneontology.org/formats/oboInOwl#hasOBONamespace>
"biological_process"^^<http://www.w3.org/2001/XMLSchema#string> .
?goid rdfs:label ?goterm}}'
go.result <- SPARQL(url = my.endpoint, query = go.query)
go.result.frame <- go.result[[1]]
anat.query <- 'select distinct ?anatterm ?anatid
where { graph <http://purl.obolibrary.org/obo/merged/UBERON>
{ ?anatid <http://www.geneontology.org/formats/oboInOwl#hasDbXref> ?xr .
?anatid rdfs:label ?anatterm}}'
anat.result <- SPARQL(url = my.endpoint, query = anat.query)
anat.result.frame <- anat.result[[1]]
# slow but recognizes multi-word substrings
loop.solution <-
mclapply(
X = sort(anat.result.frame$anatid),
mc.cores = 7,
FUN = function(one.anat.id) {
one.anat.term <-
anat.result.frame$anatterm[anat.result.frame$anatid == one.anat.id]
temp <-
grepl(pattern = paste0('\\b', one.anat.term, '\\b'),
x = go.result.frame$goterm)
temp <- go.result.frame[temp , ]
if (nrow(temp) > 0) {
temp$anatterm <- one.anat.term
temp$anatid <- one.anat.id
return(temp)
}
}
)
loop.solution <- do.call(rbind, loop.solution)
# from Brandon
# fast, but doesn't recognize multi-word matches
sep.gather.soln <-
separate(go.result.frame,
goterm,
letters,
sep = " ",
remove = FALSE) %>%
gather(goid, goterm) %>%
na.omit() %>%
setNames(c("goid", "goterm", "code", "anatterm")) %>%
select(goid, goterm, anatterm) %>%
left_join(anat.result.frame) %>%
na.omit()
我正在使用您的原始帖子数据。
第一个拆分项
再次检查字典中的关联项
三合一
mealtime dish ingredient category 1 breakfast cheese omelette cheese dairy 2 lunch turkey sandwich turkey meat 3 dinner bean soup bean legume 5 breakfast cheese omelette omelette eggs 6 lunch turkey sandwich sandwich bread
相关问题 更多 >
编程相关推荐