从字符串列表中提取第一个选项卡之前的所有文本

2024-05-16 08:09:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我有来自http://www.manythings.org/anki/的文本数据 看起来像这样

['Hi.\tHallo!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)\n',
 'Hi.\tGrüß Gott!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)\n',
 'Run!\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)\n',
 'Wow!\tPotzdonner!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122382 (Pfirsichbaeumchen)\n',
 'Wow!\tDonnerwetter!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122391 (Pfirsichbaeumchen)\n',
 'Fire!\tFeuer!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #1958697 (Tamy)\n',
 'Help!\tHilfe!\tCC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #575889 (MUIRIEL)\n',
...
]

是我干的

English = []
for sent in data_examples:
    pattern  = re.compile(r'.+?\t')
    matches = pattern.finditer(sent)
    for match in matches:
        English.append(match)

如何在课文中捕捉英语?我的不太管用


Tags: inorgforbyenglishcmhisent
2条回答

您的英语部分位于第一列

你需要做的就是

English = [sent.split('\t')[0] for sent in data_examples]

解决方案:
这可能会解决你的目的

import nltk
words = None
try:
    words = set(nltk.corpus.words.words())
except:
    nltk.download('words')
    words = set(nltk.corpus.words.words())

# Extra words which are not present in nltk words corpus
words_need_to_include = ['france']

for w in words_need_to_include:
    words.add(w)

# Words which we don't want in nltk words corpus
words_need_to_exclude = ['by']

for w in words_need_to_exclude:
    words.remove(w)

# Input data
in_text = ['Hi.\tHallo!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)\n france',
 'Hi.\tGrüß Gott!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)\n',
 'Run!\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)\n',
 'Wow!\tPotzdonner!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122382 (Pfirsichbaeumchen)\n',
 'Wow!\tDonnerwetter!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122391 (Pfirsichbaeumchen)\n',
 'Fire!\tFeuer!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #1958697 (Tamy)\n',
 'Help!\tHilfe!\tCC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #575889 (MUIRIEL)\n',
]

# Code
English = []
for x in in_text:
    English.append(" ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in words))
    #print(" ".join(nltk.wordpunct_tokenize(x)))

print(English)

输出:

['Hi France Attribution france', 'Hi France Attribution', 'Run France Attribution', 'Wow France Attribution', 'Wow France Attribution', 'Fire France Attribution', 'Help France Attribution']

相关问题 更多 >