SpaCy的most_simulation()函数在GPU上返回错误

2024-04-26 05:45:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图评估Spacy中最相似的方法(https://spacy.io/api/vectors#most_similar)的性能。我很好奇它在GPU上是否工作得更快。功能如下:

def spacy_most_similar(word, topn=10):
  ms = nlp_ru.vocab.vectors.most_similar(nlp_ru(word).vector.reshape(1,100), n=topn)
  words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
  distances = ms[2]
  return words, distances
spacy_most_similar("дерево", 10)

适用于CPU版本,但在GPU(使用CuPy阵列而不是NumPy)上,我收到一个错误:

    TypeError                                 Traceback (most recent call last)
<ipython-input-8-ea5e049ec55b> in <module>()
      7   distances = ms[2]
      8   return words, distances
----> 9 spacy_most_similar("дерево", 10)

<ipython-input-8-ea5e049ec55b> in spacy_most_similar(word, topn)
      3   print(nlp_ru(word).vector.reshape(1,100).shape)
      4   ms = nlp_ru.vocab.vectors.most_similar(
----> 5       nlp_ru(word).vector.reshape(1,100), n=topn)
      6   words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
      7   distances = ms[2]

vectors.pyx in spacy.vectors.Vectors.most_similar()

TypeError: list indices must be integers or slices, not cupy.core.core.ndarray

我也尝试过这种方法:

def spacy_most_similar(word, topn=10):
  ms = nlp_ru.vocab.vectors.most_similar(np.asarray([nlp_ru.vocab.vectors[nlp_ru.vocab.strings[word]]]), n=topn)
  words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
  distances = ms[2]
  return words, distances
spacy_most_similar("дерево", 10)

同样,在CPU上工作正常,但对于GPU版本(我将np更改为cp):

import cupy as cp
def spacy_most_similar(word, topn=10):
  with cp.cuda.Device(0):
    nlp_ru.vocab.vectors.data = cp.asarray(nlp_ru.vocab.vectors.data)
  ms = nlp_ru.vocab.vectors.most_similar(cp.asarray([nlp_ru.vocab.vectors[nlp_ru.vocab.strings[word]]]), n=topn)
  words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
  distances = ms[2]
  return words, distances
spacy_most_similar("дерево", 10)

我犯了这样一个错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-6-876656d5f75d> in <module>()
      7   distances = ms[2]
      8   return words, distances
----> 9 spacy_most_similar("дерево", 10)

<ipython-input-6-876656d5f75d> in spacy_most_similar(word, topn)
      3   with cp.cuda.Device(0):
      4     nlp_ru.vocab.vectors.data = cp.asarray(nlp_ru.vocab.vectors.data)
----> 5   ms = nlp_ru.vocab.vectors.most_similar(cp.asarray([nlp_ru.vocab.vectors[nlp_ru.vocab.strings[word]]]), n=topn)
      6   words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
      7   distances = ms[2]

vectors.pyx in spacy.vectors.Vectors.most_similar()

TypeError: unhashable type: 'cupy.core.core.ndarray'

你能帮我为最相似的()方法建立正确的CuPy输入吗


Tags: inmostnlpspacyrucpwordms
1条回答
网友
1楼 · 发布于 2024-04-26 05:45:50

考虑到现有的source code,我怀疑您能否在GPU上执行most_similar

def most_similar(self, queries, *, batch_size=1024, n=1, sort=True):
    """For each of the given vectors, find the n most similar entries
    to it, by cosine.
    Queries are by vector. Results are returned as a `(keys, best_rows,
    scores)` tuple. If `queries` is large, the calculations are performed in
    chunks, to avoid consuming too much memory. You can set the `batch_size`
    to control the size/space trade-off during the calculations.
    queries (ndarray): An array with one or more vectors.
    batch_size (int): The batch size to use.
    n (int): The number of entries to return for each query.
    sort (bool): Whether to sort the n entries returned by score.
    RETURNS (tuple): The most similar entries as a `(keys, best_rows, scores)`
        tuple.
    """
    filled = sorted(list({row for row in self.key2row.values()}))
    if len(filled) < n:
        raise ValueError(Errors.E198.format(n=n, n_rows=len(filled)))
    xp = get_array_module(self.data)

    norms = xp.linalg.norm(self.data[filled], axis=1, keepdims=True)
    norms[norms == 0] = 1
    vectors = self.data[filled] / norms

    best_rows = xp.zeros((queries.shape[0], n), dtype='i')
    scores = xp.zeros((queries.shape[0], n), dtype='f')
    # Work in batches, to avoid memory problems.
    for i in range(0, queries.shape[0], batch_size):
        batch = queries[i : i+batch_size]
        batch_norms = xp.linalg.norm(batch, axis=1, keepdims=True)
        batch_norms[batch_norms == 0] = 1
        batch /= batch_norms
        # batch   e.g. (1024, 300)
        # vectors e.g. (10000, 300)
        # sims    e.g. (1024, 10000)
        sims = xp.dot(batch, vectors.T)
        best_rows[i:i+batch_size] = xp.argpartition(sims, -n, axis=1)[:,-n:]
        scores[i:i+batch_size] = xp.partition(sims, -n, axis=1)[:,-n:]

        if sort and n >= 2:
            sorted_index = xp.arange(scores.shape[0])[:,None][i:i+batch_size],xp.argsort(scores[i:i+batch_size], axis=1)[:,::-1]
            scores[i:i+batch_size] = scores[sorted_index]
            best_rows[i:i+batch_size] = best_rows[sorted_index]

    for i, j in numpy.ndindex(best_rows.shape):
        breakpoint()
        best_rows[i, j] = filled[best_rows[i,j]]
    # Round values really close to 1 or -1
    scores = xp.around(scores, decimals=4, out=scores)
    # Account for numerical error we want to return in range -1, 1
    scores = xp.clip(scores, a_min=-1, a_max=1, out=scores)
    row2key = {row: key for key, row in self.key2row.items()}
    keys = xp.asarray(
        [[row2key[row] for row in best_rows[i] if row in row2key] 
                for i in range(len(queries)) ], dtype="uint64")
    return (keys, best_rows, scores)

注意,filled已经是一个CPU对象,它将通过从numpy数组(而不是从cupy数组)获取的索引进行正确索引。错误TypeError: list indices must be integers or slices, not cupy.core.core.ndarray来自以下两行:

for i, j in numpy.ndindex(best_rows.shape):
    best_rows[i, j] = filled[best_rows[i, j]]

如果你认为在GPU上找到最相似的单词是有价值的,你可以在https://github.com/explosion/spaCy/issues上发表一篇文章,或者写你自己的most_similar(我认为这很简单)

相关问题 更多 >