如何将单词保存在一个CSV文件中,该文件是从带有句子id号的文章中标记出来的?

2024-04-18 09:02:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从存储在CSV文件中的文章中提取所有单词,并将句子id号和包含的单词写入一个新的CSV文件。你知道吗

我已经试过了

import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.read_csv(r"D:\data.csv", nrows=10)

row = 0; sentNo = 0
while( row < 1 ):
    sentences = tokenizer.tokenize(df['articles'][row])
    for index, sents in enumerate(sentences):
        sentNo += 1
        words = word_tokenize(sents)
        print(f'{sentNo}: {words}')
    row += 1

df['articles'][0]包含:

The ultimate productivity hack is saying no. Not doing something will always be faster than doing it. This statement reminds me of the old computer programming saying, “Remember that there is no code faster than no code.”

我只取df['articles'][0],它给出如下输出:

1:['The', 'ultimate', 'productivity', 'hack', 'is', 'saying', 'no', '.']
2:['Not', 'doing', 'something', 'will', 'always', 'be', 'faster', 'than', 'doing', 'it', '.']
3:['This', 'statement', 'reminds', 'me', 'of', 'the', 'old', 'computer', 'programming', 'saying', ',', '“', 'Remember', 'that', 'there', 'is', 'no', 'code', 'faster', 'than', 'no', 'code', '.', '”']

如何以给定格式编写一个新的output.csv文件,其中包含data.csv文件中所有文章的所有句子:

Sentence No | Word
1             The
              ultimate
              productivity
              hack
              is
              saying
              no
              .
2             Not
              doing 
              something 
              will
              always
              be
              faster
              than
              doing
              it
              .
3             This 
              statement 
              reminds 
              me 
              of 
              the 
              old 
              computer 
              programming 
              saying
              , 
              “
              Remember
              that 
              there
              is
              no
              code
              faster
              than
              no
              code
              .
              ”

我是Python新手,在Jupyter笔记本上使用它。你知道吗

这是我第一篇关于堆栈溢出的文章。如果有什么不对劲的地方,纠正我学。非常感谢。你知道吗


Tags: 文件csvnodfis文章codearticles
1条回答
网友
1楼 · 发布于 2024-04-18 09:02:09

只需要重复单词并为每个单词写一行新行。你知道吗

这将是一个有点不可预测的,因为你有逗号作为“词”,以及-可能需要考虑另一个分隔符或删除逗号从您的单词列表。你知道吗

编辑:这似乎是一个更干净的方法。你知道吗

import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize

df = pd.read_csv(r"D:\data.csv", nrows=10)
sentences = tokenizer.tokenize(df['articles'[row]])
f = open('output.csv','w+')
stcNum = 1

for stc in sentences:
  for word in stc:
    prntLine = ','
    if word == stc[0]:
      prntLine = str(stcNum) + prntLine
    prntLine = prntLine + word + '\r\n'
    f.write(prntLine)
  stcNum += 1

f.close()

你知道吗输出.csv地址:

1,The
,ultimate
,productivity
,hack
,is
,saying
,no
,.
2,Not
,doing
,something
,will
,always
,be
,faster
,than
,doing
,it
,.
3,This
,statement
,reminds
,me
,of
,the
,old
,computer
,programming
,saying
,,     # <<< Most CSV parsers will see this as 3 empty columns
,“
,Remember
,that
,there
,is
,no
,code
,faster
,than
,no
,code
,.
,”

相关问题 更多 >