在文本d中使用python查找每个单词的支持

2024-04-25 00:21:12 发布

您现在位置:Python中文网/ 问答频道 /正文

在Python中,如何从数据集中查找每个不同单词的计数: https://drive.google.com/open?id=1ADdzZp31SwiF70IZ13hbAtPNHBv5NmOY

我已使用导入数据集:

# Load the data
fin = open("b.txt", 'r')
translist = []
for line in fin:
    trans = line.strip().split(' ')
    translist.append(trans)  

我需要每个元素的支持来执行连续模式采矿。为了例如,假设短语“parking lot”有一个绝对支持133,那么对应于“b.txt”中这个频繁连续序列模式的行应该是:

133:停车场


Tags: 数据httpstxtcomidtransgoogleline
1条回答
网友
1楼 · 发布于 2024-04-25 00:21:12

这似乎管用。为字典采样的最大长度短语是可变pŠlength(I设置3),为排名列表采样的最大长度短语是pŠsize(I设置3,越小,当然最高频率越高),并且最终排名列表中的单词数是可变秩(I设置25)。这些设置在第8-10行。它打印的排名列表的长度(请参见“def top\u list():”的末尾),是以单词数表示的达到p\u长度的短语总数。你知道吗

# Load the data
fin = open("b.txt", 'r')
translist = []
for line in fin:
    trans = line.strip().split(' ')
    translist.extend(trans)

p_length = 3
p_size = 3
rank = 25

#Use a dictionary to create a histogram1 of the frequencies of the phrases (but this list is not in order)
def histogram1(translist,p_length):
    global dict1
    dict1 = dict()
    phraseList = []
    for transIndex in range(len(translist)):
        for i in range(p_length):
            if (transIndex+1+i) <= len(translist):
                phraseElementNow = translist[transIndex+i]
            else:
                continue
            if i > 0:
                joinables = (newElement, phraseElementNow)
                newElement = ' '.join(joinables)
            else:
                newElement = phraseElementNow
            phraseList.append(newElement)
    for element2 in phraseList:
        if element2 not in dict1:
            dict1[element2] = 1
        else:
            dict1[element2] += 1
    return dict1

#Create the ranked list of phrases vs their frequency.
def top_list():
    global topList
    topList = []
    for key, value in dict1.items():
        topList.append((value, key))
    topList.sort(reverse = True)
    print("Length of ranking list is: ") #Just a check
    print(len(topList))
    #print(topList[-(rank):])   Used this to check format of ranking list

#Choose the top x ranking to print (I made it 25 on line 9).
def short_list(p_size, rank):
    topTopList = []
    print("The "+str(rank)+" most common phrases "+str(p_size)+" words long are: ")
    for phrase in topList:
        phraseParts = phrase[1].split(' ')
        if len(phraseParts) == p_size:
            topTopList.append(phrase)
        else:
            continue
    for freq, word in topTopList[:rank]:
        wordParts = word.split(' ')
        wordForPrint = ';'.join(wordParts)
        completePrint = str(freq)+':'+wordForPrint
        print(completePrint)

print(histogram1(translist, p_length))
top_list()
short_list(p_size, rank)

相关问题 更多 >