如何让基本的倒排索引程序更具Python风格

3 投票

2 回答

1539 浏览

提问于 2025-04-18 11:48

我有一段关于倒排索引的代码，如下所示。不过我对这段代码不是很满意，想知道怎么能让它更简洁、更符合Python的风格。

class invertedIndex(object):


  def __init__(self,docs):
     self.docs,self.termList,self.docLists=docs,[],[]

     for index,doc in enumerate(docs):

        for term in doc.split(" "):
            if term in self.termList:
                i=self.termList.index(term)
                if index not in self.docLists[i]:
                    self.docLists[i].append(index)

            else:
                self.termList.append(term)
                self.docLists.append([index])  

  def search(self,term):
        try:
            i=self.termList.index(term)
            return self.docLists[i]
        except:
            return "No results"





docs=["new home sales top forecasts june june june",
                     "home sales rise in july june",
                     "increase in home sales in july",
                     "july new home sales rise"]

i=invertedIndex(docs)
print invertedIndex.search("sales")

代码优化编程风格倒排索引

2 个回答

这个解决方案和@Peter Gibson的几乎一模一样。在这个版本中，索引就是数据，没有使用额外的docSets对象。这让代码稍微短一些，也更清晰。

这段代码还保持了文档的原始顺序……这有点像个小bug，我更喜欢Peter的set()实现。

另外要注意的是，引用一个不存在的词，比如ix['garbage']，会隐式地修改索引。如果唯一的接口是search，那这样是没问题的，但这个情况值得注意。

源代码

class InvertedIndex(dict):
    def __init__(self, docs):
        self.docs = docs

        for doc_index,doc in enumerate(docs):
            for term in doc.split(" "):
                self[term].append(doc_index)

    def __missing__(self, term):
        # operate like defaultdict(list)
        self[term] = []
        return self[term]

    def search(self, term):
        return self.get(term) or 'No results'


docs=["new home sales top forecasts june june june",
      "home sales rise in july june",
      "increase in home sales in july",
      "july new home sales rise",
      'beer',
      ]

ix = InvertedIndex(docs)
print ix.__dict__
print
print 'sales:',ix.search("sales")
print 'whiskey:', ix.search('whiskey')
print 'beer:', ix.search('beer')

print '\nTEST OF KEY SETTING'
print ix['garbage']
print 'garbage' in ix
print ix.search('garbage')

输出

{'docs': ['new home sales top forecasts june june june', 'home sales rise in july june', 'increase in home sales in july', 'july new home sales rise', 'beer']}

sales: [0, 1, 2, 3]
whiskey: No results
beer: [4]

TEST OF KEY SETTING
[]
True
No results

回答于 2025-04-18 由 Python大师

分享举报

把文档的索引存储在一个Python的集合里，然后用一个字典来为每个词引用这个“文档集合”。

from collections import defaultdict

class invertedIndex(object):

  def __init__(self,docs):
      self.docSets = defaultdict(set)
      for index, doc in enumerate(docs):
          for term in doc.split():
              self.docSets[term].add(index)

  def search(self,term):
        return self.docSets[term]

docs=["new home sales top forecasts june june june",
                     "home sales rise in july june",
                     "increase in home sales in july",
                     "july new home sales rise"]

i=invertedIndex(docs)
print i.search("sales") # outputs: set([0, 1, 2, 3])

集合有点像列表，但它是无序的，并且不能包含重复的条目。

defaultdict基本上是一个字典，当没有数据时，它会有一个默认的类型（在这个例子中是一个空集合）。

回答于 2025-04-18 由 Python大师

分享举报

如何让基本的倒排索引程序更具Python风格

2 个回答

源代码

输出

撰写回答