如何使用pandas解析文本文件并创建列表

2024-06-17 15:27:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用pandas创建一个列表/数组,其中包含以下文本文件“review/text”字段中的所有单词:

product/productId: B001E4KFG0 review/userId: A3SGXH7AUHU8GW review/profileName: delmartian review/helpfulness: 1/1 review/score:
5.0 review/time: 1303862400 review/summary: Good Quality Dog Food review/text: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.

product/productId: B00813GRG4 review/userId: A1D87F6ZCVE5NK review/profileName: dll pa review/helpfulness: 0/0 review/score: 1.0 review/time: 1346976000 review/summary: Not as Advertised review/text: Product arrived labeled as Jumbo Salted Peanuts...

(文本文件foods.txt位于:http://snap.stanford.edu/data/web-FineFoods.html

我的最终目标是识别评论/文本字段中出现的所有独特单词

我编写了以下代码:

    import pandas as pd
    
    f=open("foods.txt","r")
    df=pd.read_csv(f,names=['product/productId','review/userId','review/profileName','review/helpfulness','review/score','review/time','review/summary'])
    selected = df[ df['review/summary'] ] 
    print(selected)

selected.to_csv('result.csv', sep=' ', header=False)

但是,我得到的错误如下:

ValueError: cannot index with vector containing NA / NaN values

有什么建议/意见吗


Tags: andcsvtextdftimeassummaryproduct
3条回答

我认为您必须这样做才能从文件中提取所有记录,并获得审阅/摘要值。您不需要数据帧

#create a dictionary to store the list of review summary values
d = {'review summary':[]}

#function to extract only the review_summary from the line
def split_review_summary(full_line):
    
    #find review/text and exclude it from the line
    found = full_line.find('review/text:')
    if found >= 0:
        full_line = full_line[:found]

    #find review summary. All text to the right is review summary
    #add this to the dictionary
    found = full_line.find('review/summary:')
    if found >= 0:
        review_summary = full_line[(found + 15):]
        d['review summary'].append(review_summary)

#open the file for reading
with open ("xyz.txt","r") as f:
    #read the first line
    new_line = f.readline().rstrip('\n')
    #loop through the rest of the lines
    for line in f:
        #remove newline from the data
        line = line.rstrip('\n')
        
        #if the line starts with product/productId, then its a new entry
        #process the previous line and strip out the review_summary
        #to do that, call split_review_summary function
        
        if line[:17] == 'product/productId':
            split_review_summary(new_line)
            #reset new_line to the current line
            new_line = line
        else:
            #append to the new_line as its part of the previous record
            new_line += line

#the last full record has not been processed
#So send it to split_review_summary to extract review summary
split_review_summary(new_line)

#now dictionary d has all the review summary items
print (d)

其输出将为:

{'review summary': [' Good Quality Dog Food ', ' Not as Advertised ']}

我认为你的问题范围还包括写一个新文件

您可以打开一个文件并将字典作为一行编写。这将包含所有细节。我将把这部分留给你来解决

我查看了S.Ghoshal提供的链接,得出以下结论:

#Opening your file
your_file = open('foods.txt')

#Reading every line
reviews = your_file.readlines()

reviews_array = []
dictionary = {}

#We are going through every line and skip it when we see that it's a blank line
for review in reviews:
    this_line = review.split(":")
    if len(this_line) > 1:
        #The blank lines are less than 1 in length after the split
        dictionary[this_line[0]] = this_line[1].strip()
        #Every first part before ":" is the key of the dictionary, and the second part id the content.
    else:
        #If a blank linee was found lets save the object in the array and reset it
        #for the next review
        reviews_array.append(dictionary)
        dictionary = {}

#Append the last object because it goes out the last else
reviews_array.append(dictionary)

f1=open("output.txt","a")
for r in reviews_array:
    print(r['review/text'], file=f1)
f1.close()

现在,以review/text开头的行中的所有单词都转储到一个文件中。接下来,我需要创建一个包含所有唯一单词的列表

CSV文件表示逗号分隔的值。我在你的档案里没有看到任何昏迷

它看起来像一个破损的字典(每个条目缺少分隔逗号):

my_dict ={
 'productid': 12312312,
 'some_key': 'I am the key!',
}

相关问题 更多 >