从 pandas DataFrame 创建词密度矩阵的高效方法

Question

我正在尝试从一个pandas数据框中创建一个术语密度矩阵，这样我就可以对数据框中出现的术语进行评分。我还希望能够保留数据的“空间”特性（在帖子末尾的评论中有我想表达的例子）。

我对pandas和NLTK还不太熟悉，所以我希望我的问题能用一些现有的工具解决。

我的数据框中有两列比较重要：比如说“标题”和“页面”。

    import pandas as pd
    import re

    df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ','Split orange','Something else'], 'page':[1, 2, 3, 4]})
    df.head()

       page                 title
    0     1  Delicious boiled egg
    1     2            Fried egg 
    2     3          Split orange
    3     4        Something else

我的目标是清理文本，并将感兴趣的术语传递到一个TDM数据框中。我使用了两个函数来帮助我清理字符串。

    import nltk.classify
    from nltk.tokenize import wordpunct_tokenize
    from nltk.corpus import stopwords
    import string   

    def remove_punct(strin):
        '''
        returns a string with the punctuation marks removed, and all lower case letters
        input: strin, an ascii string. convert using strin.encode('ascii','ignore') if it is unicode 
        '''
        return strin.translate(string.maketrans("",""), string.punctuation).lower()

    sw = stopwords.words('english')

    def tok_cln(strin):
        '''
        tokenizes string and removes stopwords
        '''
        return set(nltk.wordpunct_tokenize(strin)).difference(sw)

还有一个函数用来处理数据框的操作。

    def df2tdm(df,titleColumn,placementColumn,newPlacementColumn):
        '''
        takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
        of the words appearing in the titleColumn
        Inputs: df, a DataFrame containing titleColumn, placementColumn among others
        Outputs: tdm_df, a DataFrame containing newPlacementColumn and columns with all the terms in df[titleColumn]
        '''
        tdm_df = pd.DataFrame(index=df.index, columns=[newPlacementColumn])
        tdm_df = tdm_df.fillna(0)
        for idx in df.index:
            for word in tok_cln( remove_punct(df[titleColumn][idx].encode('ascii','ignore')) ):
                if word not in tdm_df.columns:
                    newcol = pd.DataFrame(index = df.index, columns = [word])
                    tdm_df = tdm_df.join(newcol)
        tdm_df[newPlacementColumn][idx] = df[placementColumn][idx]
        tdm_df[word][idx] = 1
        return tdm_df.fillna(0,inplace = False)

    tdm_df = df2tdm(df,'title','page','pub_page')
    tdm_df.head()

这个过程的结果是：

      pub_page boiled egg delicious fried orange split something else
    0        1      1   1         1     0      0     0         0    0
    1        2      0   1         0     1      0     0         0    0
    2        3      0   0         0     0      1     1         0    0
    3        4      0   0         0     0      0     0         1    1

但是在处理大数据集时（输出有十万行，几千列）速度非常慢。我的两个问题是：

我能加快这个实现的速度吗？

有没有其他工具可以用来完成这个任务？

我希望能够保留数据的“空间”特性，比如如果“蛋”这个词在第1到第10页出现得很频繁，然后在第500到第520页又频繁出现，我想知道这一点。

数据处理自然语言处理文本清理数据框大数据词频分析术语密度矩阵空间特性

从 pandas DataFrame 创建词密度矩阵的高效方法

2 个回答

撰写回答