通过部分匹配大于n个字符的单词的两列来子集行

result<-structure(list(traffic_Count_Street = c("San Angelo", "W Commerce St", "W Commerce St", "S Gevers St", "Austin Hwy", "W Evergreen St" ), unit_Street = c("San Pedro Ave", "W Commerce", "W Commerce", "S New Braunfels", "Austin Highway", "W Cypress")), .Names = c("traffic_Count_Street", "unit_Street"), row.names = c(1L, 17L, 18L, 34L, 260L, 273L), class = "data.frame") 1 San Angelo San Pedro Ave 17 W Commerce St W Commerce 18 W Commerce St W Commerce 34 S Gevers St S New Braunfels 260 Austin Hwy Austin Highway 273 W Evergreen St W Cypress

1条回答

网友

1楼 · 发布于 2024-05-16 15:05:06

创建具有阈值调整的距离过滤器。然后你可以调整，直到你得到你想要的结果。在这种情况下，Levenshtein距离为5效果良好：

distanceFilter <- function(df, thresh=5) {
  ind <- apply(df, 1, function(x) adist(x[1], x[2]) < thresh )
  df[ind,]
}

distanceFilter(result, 5)
#     traffic_Count_Street    unit_Street
# 17         W Commerce St     W Commerce
# 18         W Commerce St     W Commerce
# 260           Austin Hwy Austin Highway

要了解更多信息，请参见the wiki page和R doc help page

相关问题更多 >

编程相关推荐

热门问题

热门文章