把输入文件分开是错误的

2024-04-20 06:27:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图写一个程序,输入两个txt文件,由用户所说的,采取的关键字文件和分裂成字和值,然后采取tweets文件,分裂成一个位置和tweet/时间。你知道吗

关键字文件(单间距.txt文件)示例:

*爱,10

比如,5

最佳,10

仇恨,1

哈哈,10

更好,10*

tweets文件的示例(注意,这里只显示了四行,实际的.txt文件中实际上有几百行):

[41.29866962999999,-81.915329330000006]6 2011-08-28 19:02:36工作需要经过。。。我很高兴看到间谍儿童4与当时我的生命之爱。。。欠费的

[33.70290032999999,-117.9509570400001]6 2011-08-28 19:03:13今天将是我一生中最伟大的一天。受雇在我最好朋友的父母50周年纪念日拍照。60位老人。求爱。你知道吗

[38.80995493999997,-77.125144050000003]6 2011-08-28 19:07:05我把我的生命放在5个箱子里

[27.994195699999999,-82.569434900000005]6 2011-08-28 19:08:02@玛丽小姐是我一生的挚爱

到目前为止,我的程序看起来像:

#prompt the user for the file name of keywords file
keywordsinputfile = input("Please input file name: ")
tweetsinputfile = input ("Please input tweets file name: ")

#try to open given input file
try:
    k=open(keywordsinputfile, "r")
except IOError:
    print ("{} file not found".format(keywordsinputfile))
try:
    t=open(tweetsinputfile, "r")
except IOError:
    print ("{} file not found".format(tweetsinputfile))
    exit()

def main ():   #main function
    kinputfile = open(keywordsinputfile, "r")         #Opens File for keywords
    tinputfile = open(tweetsinputfile, "r")           #Opens file for tweets
    HappyWords = {}
    HappyValues = {}
    for line in kinputfile:                           #splits keywords
        entries = line.split(",")
        hvwords = str(entries[0])
        hvalues = int(entries[1])
        HappyWords["keywords"] = hvwords           #stores Happiness keywords
        HappyValues["values"] = hvalues            #stores Happiness Values
    for line in tinputfile:
        twoparts = line.split("]")  #splits tweet file by ] creating a location and tweet parts, tweets are ignored for now
        startlocation = (twoparts[0])   #takes the first part (the locations)
    def testing(startlocation):
        for line in startlocation:     
            intlocation = line.split("[")      #then gets rid of the "[" at the beginning of the locations
            print (intlocation)
    testing(startlocation)

main()

我希望从中得到的是(对于无限多的行,实际文件包含的内容远远超过上面显示的四行)

41.298669629999999, -81.915329330000006
33.702900329999999, -117.95095704000001
38.809954939999997, -77.125144050000003
27.994195699999999, -82.569434900000005

我得到的是:

['', '']
['2']
['7']
['.']
['9']
['9']
['4']
['1']
['9']
['5']
['6']
['9']
['9']
['9']
['9']
['9']
['9']
['9']
['9']
[',']
[' ']
['-']
['8']
['2']
['.']
['5']
['6']
['9']
['4']
['3']
['4']
['9']
['0']
['0']
['0']
['0']
['0']
['0']
['0']
['5']

换句话说,它只处理txt文件的最后一行,并将其单独拆分。你知道吗

在此之后,我必须以这样的方式存储它们,我可以将它们再次拆分为一个列表中的第一部分和另一个列表中的第二部分 (示例:

for line in locations:
    entries = line.split(",")
    latitude = intr(entries[0])
    longitude = int(entries[1])

提前谢谢!你知道吗


Tags: 文件theintxtforinputlineopen
2条回答

您只需要插入一些跟踪打印语句来显示发生了什么。我是这样做的:

for line in tinputfile:
    twoparts = line.split("]")  #splits tweet file by ] creating a location and tweet parts, tweets are ignored for now
    startlocation = (twoparts[0])   #takes the first part (the locations)
    print ("     -")
    print ("twoparts", twoparts) 
    print ("startlocation", startlocation)
def testing(startlocation):
    for line in startlocation:     
        print ("line", line)
        intlocation = line.split("[")      #then gets rid of the "[" at the beginning of the locations
        print ("intlocation", intlocation)
testing(startlocation)

。。。找到了一条线索,开头是:

     -
twoparts ['[41.298669629999999, -81.915329330000006', " 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life ... ARREIC\n"]
startlocation [41.298669629999999, -81.915329330000006
     -
twoparts ['[33.702900329999999, -117.95095704000001', " 6 2011-08-28 19:03:13 Today is going to be the greatest day of my life. Hired to take pictures at my best friend's gparents 50th anniversary. 60 old people. Woo.\n"]
startlocation [33.702900329999999, -117.95095704000001
     -
twoparts ['[38.809954939999997, -77.125144050000003', ' 6 2011-08-28 19:07:05 I just put my life in like 5 suitcases\n']
startlocation [38.809954939999997, -77.125144050000003
     -
twoparts ['[27.994195699999999, -82.569434900000005', ' 6 2011-08-28 19:08:02 @Miss_mariiix3 is the love of my life\n']
startlocation [27.994195699999999, -82.569434900000005
line [
intlocation ['', '']
line 2
intlocation ['2']
line 7

分析:

有两个基本问题:

  1. 处理语句testing(startlocation)位于循环之外,因此它只使用最后一个输入行。你知道吗
  2. 正如您在“twoparts”的输出中所看到的,您所需的坐标仍然是string格式,而不是浮点列表。你需要把支架剥下来,把它们分开。然后将它们转换为float。在当前表单中,当您遍历intlocation时,您遍历的是字符串的字符,而不是两个float。你知道吗

另外:为什么要在循环中定义函数?这将在每次执行时重新定义函数。将它移到主程序之前;这是表现良好的函数的所在。:-)


添加了关于第2点的信息:

让我们使用示例输入的最后一行,逐步浏览您的代码。 从tinputfile中的行的循环顶部开始

twoparts = line.split("]")

两部分现在是一对元素,两个字符串:

['[27.994195699999999, -82.569434900000005',
 ' 6 2011-08-28 19:08:02 @Miss_mariiix3 is the love of my life\n']

然后将startlocation设置为第一个元素:

'[27.994195699999999, -82.569434900000005'

然后是对函数测试的冗余重新定义,它不会产生任何变化。下一个语句调用测试;我们进入例程。你知道吗

testing(startlocation)
for line in startlocation:

这里重要的一点是,shortocation是一个字符串

'[27.994195699999999, -82.569434900000005'

。。。所以当你执行这个循环时,你迭代字符串,一次一个字符。你知道吗

更正:

老实说,我不知道测试应该做什么。 看起来你所需要做的就是剥掉那个支架:

intlocation = startlocation.split('[')

。。。或者只是

intlocation = startlocation[1:]

相反,如果希望将float值作为两个元素的列表,(a)去掉上面的括号,在逗号处拆分元素,然后转换为float:

intlocation = [ float(x) for x in startlocation[1:].split(',') ]

看起来,它真正需要的是ast.literal_eval。你知道吗

for line in tinputfile:
    twoparts = line.split("]")
    startlocation = ast.literal_eval(twoparts[0] + ']') # add the ']' back in
    # startlocation is now a list of two coordinates.

但是你最好还是用re。你知道吗

> import re
> example = '[27.994195699999999, -82.569434900000005] 6 2011-08-28 19:02:36 text text text text'
> fmt = re.split(r'\[(-?[0-9.]+),\s?(-?[0-9.]+).\s*\d\s*(\d{4}-\d{1,2}-\d{1,2}\s+\d{2}:\d{2}:\d{2})',example)
> fmt
['', '27.994195699999999', '-82.569434900000005', '2011-08-28 19:02:36', ' text text text text']
> location = (float(fmt[1]), float(fmt[2]))
> time = fmt[3]
> text = fmt[4]

怎么回事?你知道吗

正则表达式(re模块)中的每一个(...)都告诉re.split“将此片段作为自己的索引”。你知道吗

第一个和第二个是-?[0-9.]。这意味着匹配任何可能有一个负号后接数字和小数位(我们可以更严格,但你真的不需要)。你知道吗

下一组()匹配任何日期:\d{4}表示“四位数”。\d{1,2}表示“一个或两个数字”。你知道吗

或者,您可以同时使用这两者:

> fmt = re.split(r'\[(-?[0-9.]+,\s?-?[0-9.]+).\s*\d\s*(\d{4}-\d{1,2}-\d{1,2}\s+\d{2}:\d{2}:\d{2})',example)
> fmt # watch what happens when I change the grouping.
['', '27.994195699999999, -82.569434900000005', '2011-08-28 19:02:36', ' text text text text']
> location = literal_eval('(' + fmt[1] + ')')
> time = fmt[2]
> text = fmt[3]

相关问题 更多 >