从字符串列表中提取第一个选项卡之前的所有文本

['Hi.\tHallo!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)\n', 'Hi.\tGrüß Gott!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)\n', 'Run!\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)\n', 'Wow!\tPotzdonner!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122382 (Pfirsichbaeumchen)\n', 'Wow!\tDonnerwetter!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122391 (Pfirsichbaeumchen)\n', 'Fire!\tFeuer!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #1958697 (Tamy)\n', 'Help!\tHilfe!\tCC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #575889 (MUIRIEL)\n', ... ]

2条回答

网友

1楼 · 编辑于 2024-05-16 08:09:24

您的英语部分位于第一列

你需要做的就是

English = [sent.split('\t')[0] for sent in data_examples]

网友

2楼 · 编辑于 2024-05-16 08:09:24

解决方案：
这可能会解决你的目的

import nltk
words = None
try:
    words = set(nltk.corpus.words.words())
except:
    nltk.download('words')
    words = set(nltk.corpus.words.words())

# Extra words which are not present in nltk words corpus
words_need_to_include = ['france']

for w in words_need_to_include:
    words.add(w)

# Words which we don't want in nltk words corpus
words_need_to_exclude = ['by']

for w in words_need_to_exclude:
    words.remove(w)

# Input data
in_text = ['Hi.\tHallo!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)\n france',
 'Hi.\tGrüß Gott!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)\n',
 'Run!\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)\n',
 'Wow!\tPotzdonner!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122382 (Pfirsichbaeumchen)\n',
 'Wow!\tDonnerwetter!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122391 (Pfirsichbaeumchen)\n',
 'Fire!\tFeuer!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #1958697 (Tamy)\n',
 'Help!\tHilfe!\tCC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #575889 (MUIRIEL)\n',
]

# Code
English = []
for x in in_text:
    English.append(" ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in words))
    #print(" ".join(nltk.wordpunct_tokenize(x)))

print(English)

输出：

['Hi France Attribution france', 'Hi France Attribution', 'Run France Attribution', 'Wow France Attribution', 'Wow France Attribution', 'Fire France Attribution', 'Help France Attribution']

相关问题更多 >

编程相关推荐

热门问题

热门文章