在Python中读取行、处理为列表并写入文件

2 投票
2 回答
1396 浏览
提问于 2025-04-17 20:53

我刚开始学习Python,现在正在处理下面这些推文:

@PrincessSuperC Hey Cici sweetheart! Just wanted to let u know I luv u! OH! and will the mixtape drop soon? FANTASY RIDE MAY 5TH!!!!  
@Msdebramaye I heard about that contest! Congrats girl!! 
UNC!!! NCAA Champs!! Franklin St.: I WAS THERE!! WILD AND CRAZY!!!!!! Nothing like it...EVER http://tinyurl.com/49955t3
Do you Share More #jokes #quotes #music #photos or #news #articles on #Facebook or #Twitter?
Good night #Twitter and #TheLegionoftheFallen.  5:45am cimes awfully early!
I just finished a 2.66 mi run with a pace of 11'14"/mi with Nike+ GPS. #nikeplus #makeitcount
Disappointing day. Attended a car boot sale to raise some funds for the sanctuary, made a total of 88p after the entry fee - sigh
no more taking Irish car bombs with strange Australian women who can drink like rockstars...my head hurts.
Just had some bloodwork done. My arm hurts

我希望能得到一个特征向量的输出,应该是这样的:

featureList = ['hey', 'cici', 'luv', 'mixtape', 'drop', 'soon', 'fantasy', 'ride', 'heard', 
'congrats', 'ncaa', 'franklin', 'wild', 'share', 'jokes', 'quotes', 'music', 'photos', 'news',
'articles', 'facebook', 'twitter', 'night', 'twitter', 'thelegionofthefallen', 'cimes', 'awfully',
'finished', 'mi', 'run', 'pace', 'gps', 'nikeplus', 'makeitcount', 'disappointing', 'day', 'attended',
'car', 'boot', 'sale', 'raise', 'funds', 'sanctuary', 'total', 'entry', 'fee', 'sigh', 'taking',
'irish', 'car', 'bombs', 'strange', 'australian', 'women', 'drink', 'head', 'hurts', 'bloodwork', 
'arm', 'hurts']

可是,我现在得到的输出只有:

hey, cici, luv, mixtape, drop, soon, fantasy, ride

这只来自第一条推文。而且它一直在循环这个推文,根本没有跳到下一条。我试着用nextLine,但在Python里好像不管用。我的代码如下:

#import regex
import re
import csv
import pprint
import nltk.classify

#start replaceTwoOrMore
def replaceTwoOrMore(s):
    #look for 2 or more repetitions of character
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL) 
    return pattern.sub(r"\1\1", s)
#end

#start process_tweet
def processTweet(tweet):
    # process the tweets

    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)    
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet
#end 

#start getStopWordList
def getStopWordList(stopWordListFileName):
    #read the stopwords
    stopWords = []
    stopWords.append('AT_USER')
    stopWords.append('URL')

    fp = open(stopWordListFileName, 'r')
    line = fp.readline()
    while line:
        word = line.strip()
        stopWords.append(word)
        line = fp.readline()
    fp.close()
    return stopWords
#end

#start getfeatureVector
#start getfeatureVector
def getFeatureVector(tweet):
    featureVector = []
    #split tweet into words
    words = tweet.split()
    for w in words:
        #replace two or more with two occurrences
        w = replaceTwoOrMore(w)
        #strip punctuation
        w = w.strip('\'"?,.')
        #check if the word stats with an alphabet
        val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
        #ignore if it is a stop word
        if(w in stopWords or val is None):
            continue
        else:
            featureVector.append(w.lower())
    return featureVector
#end

#Read the tweets one by one and process it
fp = open('data/sampleTweets.txt', 'r')
line = fp.readline()

st = open('data/feature_list/stopwords.txt', 'r')
stopWords = getStopWordList('data/feature_list/stopwords.txt')

while line:
    processedTweet = processTweet(line)
    featureVector = getFeatureVector(processedTweet)
    with open('data/niek_corpus_feature_vector.txt', 'w') as f:
        f.write(', '.join(featureVector))
#end loop
fp.close()

更新:
我按照下面的建议尝试修改了循环:

st = open('data/feature_list/stopwords.txt', 'r')
stopWords = getStopWordList('data/feature_list/stopwords.txt')

with open('data/sampleTweets.txt', 'r') as fp:
    for line in fp:
        processedTweet = processTweet(line)
        featureVector = getFeatureVector(processedTweet)
        with open('data/niek_corpus_feature_vector.txt', 'w') as f:
            f.write(', '.join(featureVector))
fp.close()

结果我得到的输出只有推文最后一行的单词。

bloodwork, arm, hurts

我还在努力弄明白这个问题。

2 个回答

1
line = fp.readline()

这个代码只读取文件中的一行。然后你在循环中处理那一行,处理完就立刻退出了。其实你需要读取文件中的每一行。等你把整个文件都读完后,再像之前那样处理每一行。

lines = fp.readlines()

# Now process each line

for line in lines:
  # Now process the line as you do in your original code
  while line:
    processedTweet = processTweet(line)

Python 文件 readlines() 方法

这个方法 readlines() 会一直读取文件,直到到达文件末尾(EOF),并返回一个包含所有行的列表。如果你提供了可选的 sizehint 参数,那么它就不会一直读取到文件末尾,而是会读取大约 sizehint 字节的完整行(可能会向上调整到一个内部缓冲区的大小)。

下面是 readlines() 方法的语法:

fileObject.readlines( sizehint ); Parameters sizehint -- This is the number of bytes to be read from the file.

Return Value: This method returns a list containing the lines.

示例:下面的示例展示了 readlines() 方法的用法。

 #!/usr/bin/python

 # Open a file 
fo = open("foo.txt", "rw+") print "Name of the file: ", fo.name

 # Assuming file has following 5 lines
 # This is 1st line
 # This is 2nd line
 # This is 3rd line
 # This is 4th line
 # This is 5th line

line = fo.readlines() print "Read Line: %s" % (line)

line = fo.readlines(2) print "Read Line: %s" % (line)

# Close opend file 

fo.close() 

让我们编译并运行上面的程序,这样会产生以下结果:

 Name of the file:  foo.txt Read Line: ['This is 1st line\n', 'This is
 2nd line\n', 
             'This is 3rd line\n', 'This is 4th line\n', 
             'This is 5th line\n'] 
Read Line: []
1

如果你只想用readline()而不想用readlines,可以使用一个循环,像下面这样。

st = open('data/feature_list/stopwords.txt', 'r')
stopWords = getStopWordList('data/feature_list/stopwords.txt')
with open('data/sampleTweets.txt', 'r') as fp:
    for line in fp:
        processedTweet = processTweet(line)
        featureVector = getFeatureVector(processedTweet)
        with open('data/niek_corpus_feature_vector.txt', 'ab') as f:
            f.write(', '.join(featureVector))

撰写回答