在Python中将字符串拆分为单个单词

4 投票

3 回答

3809 浏览

数据工程师

提问于 2025-04-16 22:39

我有一个很大的域名列表，大约有六千个，我想看看哪些词的出现频率最高，以便大致了解我们的资产情况。

我遇到的问题是，这个列表的格式是域名，比如：

examplecartrading.com

examplepensions.co.uk

exampledeals.org

examplesummeroffers.com

+5996

直接统计词频会得到一些无用的信息。所以我想最简单的方法就是在完整的单词之间插入空格，然后再进行词频统计。

为了让我自己更轻松一点，我希望能写个脚本来处理这个。

我对python 2.7了解得很少，但我愿意接受任何建议，能给我一些代码示例就太好了。我听说使用一个简单的字符串前缀树（trie）数据结构是实现这个目标的最简单方法，但我不知道怎么在python中实现它。

数据结构字符串处理脚本编程频率分析文本分析词频统计前缀树域名分析

3 个回答

with open('/usr/share/dict/words') as f:
  words = [w.strip() for w in f.readlines()]

def guess_split(word):
  result = []
  for n in xrange(len(word)):
    if word[:n] in words and word[n:] in words:
      result = [word[:n], word[n:]]
  return result


from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
  for line in f.readlines():
    for word in line.strip().split('.'):
      if len(word) > 3:
        # junks the com , org, stuff
        for x in guess_split(word):
          word_counts[x] += 1

for spam in word_counts.items():
  print '{word}: {count}'.format(word=spam[0],count=spam[1])

这里有一种简单粗暴的方法，它只尝试把域名拆分成两个英文单词。如果这个域名不能拆分成两个英文单词，那就直接丢掉。其实要扩展这个方法，让它尝试更多的拆分也不难，不过如果拆分的数量太多，可能会变得不太好用，除非你能想出一些聪明的办法。幸运的是，我想你最多只需要尝试3到4次拆分。

输出：

deals: 1
example: 2
pensions: 1

回答于 2025-04-16 由 Python大师

分享举报

假设你只有几千个普通的域名，你应该可以把这些都放在内存里处理。

domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
    for substring in all_sub_strings(domain):
        if substring in dictionary:
            found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want

print c

回答于 2025-04-16 由 Python大师

分享举报

我们尝试把一个域名（s）拆分成任意数量的单词（不只是两个），这些单词来自一个已知的单词集合（words）。使用递归来实现这个功能，真是太棒了！

def substrings_in_set(s, words):
    if s in words:
        yield [s]
    for i in range(1, len(s)):
        if s[:i] not in words:
            continue
        for rest in substrings_in_set(s[i:], words):
            yield [s[:i]] + rest

这个迭代器函数首先会返回它被调用时的字符串，如果这个字符串在words中。然后它会尝试把这个字符串分成两部分，所有可能的分法都试一遍。如果第一部分不在words中，就尝试下一个分法。如果在的话，它会把第一部分加到对第二部分调用自身的结果前面（第二部分可能没有结果，比如["example", "cart", ...]）。

接下来我们构建英语词典：

# Assuming Linux. Word list may also be at /usr/dict/words. 
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())

# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")

# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))

# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))

现在我们可以把这些组合起来：

count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk", 
    "exampledeals.org", "examplesummeroffers.com"]

# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
    # Extract the part in front of the first ".", and make it lower case
    name = domain.partition(".")[0].lower()
    found = set()
    for split in substrings_in_set(name, words):
        found |= set(split)
    for word in found:
        count[word] = count.get(word, 0) + 1
    if not found:
        no_match.append(name)

print count
print "No match found for:", no_match

结果：{'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}

使用set来存放英语词典，可以快速检查某个单词是否在里面。-=是用来从集合中移除项目的，|=是用来添加项目的。

结合使用all函数和生成器表达式可以提高效率，因为all会在第一个False时就返回。

有些子字符串可能既可以作为一个完整的单词，也可以拆分成多个单词，比如“example”和“ex” + “ample”。对于某些情况，我们可以通过排除不需要的单词来解决问题，比如上面代码中的“ex”。但对于其他情况，比如“pensions”和“pens” + “ions”，这可能是不可避免的。当这种情况发生时，我们需要防止字符串中的其他单词被多次计算（一次是“pensions”，一次是“pens” + “ions”）。我们通过在一个集合中跟踪每个域名找到的单词来做到这一点——集合会忽略重复项——然后在所有单词都找到后再进行计数。

编辑：重构了代码并添加了很多注释。强制将字符串转换为小写，以避免因为大小写问题而漏掉单词。还添加了一个列表来跟踪没有匹配单词组合的域名。

复活编辑：修改了子字符串函数，使其扩展性更好。旧版本在处理超过16个字符的域名时变得非常慢。仅使用上面的四个域名，我把自己的运行时间从3.6秒提高到了0.2秒！

回答于 2025-04-16 由 Python大师

分享举报

在Python中将字符串拆分为单个单词

3 个回答

撰写回答