Pandas replace with dictionary不适用于CSV文件

import pandas as pd def write_out_abbreviations(): """Replace abbreviations in metadata file with full words.""" # Read file into dataframe. with open('/home/username/data/metadata.csv') as f: df = pd.read_csv(f, names=['Audio_Filename', 'Segment_Text'], sep='|') # Create dictionary that contains abbreviations and their full words. replacers = { 'bspw.': 'beispielsweise', 'bzw.': 'beziehungsweise', ' ca.': ' zirka', 'd.h.': 'das heißt', 'Dr.': 'Doktor', ' ggf.': ' gegebenenfalls', 'i.d.R.': 'in der Regel', ' inkl.': ' inklusive', 'insb.': 'insbesondere', 'Tel.': 'Telefon', 'z.B.': 'zum Beispiel'} # Replace abbreviations in 'Segment_Text' column. # APPROACH 1: # df2 = df.replace({'Segment_Text': {replacers}}) # APPROACH 2: # df2 = df['Segment_Text'].replace(replacers) # APPROACH 3: # df2 = df.Segment_Text.str.split() # df2 = df.Segment_Text.apply(lambda x: ' '.join([replacers.get(e, e) for e in x])) # APPROACH 4: # df['Segment_Text'] = df['Segment_Text'].map(replacers).fillna(df['Segment_Text']) # Write this dataframe to new file. d2f.to_csv('/home/username/data/metadata_REPLACED.csv', # or df.to_csv... header=False, index=False, sep='|') write_out_abbreviations()

2条回答

网友

1楼 · 编辑于 2024-05-16 02:30:42

您正在使用正则表达式和替换函数查找^{} function：

rx = re.compile('|'.join(replacers.keys()))
df2 = df['Segment_Text'].str.replace(rx, lambda m: replacers[m.group(0)])

它给出了df2：

0    Was ist der Unterschied zwischen Gefahr und Ri...
1    Die Gefahr wird zum Beispiel in ein Risiko umg...
2    Ein Sturz das heißt ein Fall von der Kante ist...
Name: Segment_Text, dtype: object

网友

2楼 · 编辑于 2024-05-16 02:30:42

你可以试试这个：

样本输入：

import pandas as pd
df = pd.DataFrame({'c1109db0.wav': {0: 'c112c091.wav', 1: 'c11335c1.wav'},
 'Was_ist_der_Unterschied_zwischen_Gefahr_und_Risiko?': {0: 'Die Gefahr wird z.B. in ein Risiko umgewandelt.',
  1: 'Ein Sturz d.h. ein Fall von der Kante ist ein Risiko.'}})

代码：

replacers = {
    'bspw.': 'beispielsweise',
    'bzw.': 'beziehungsweise',
    'ca.': ' zirka',
    'd.h.': 'das heißt',
    'Dr.': 'Doktor',
    'ggf.': ' gegebenenfalls',
    'i.d.R.': 'in der Regel',
    'inkl.': ' inklusive',
    'insb.': 'insbesondere',
    'Tel.': 'Telefon',
    'z.B.': 'zum Beispiel'}

df.iloc[:,1] = df.iloc[:,1].str.split().map(lambda lst: ' '.join([replacers.get(word, word) for word in lst]))

# Out[158]:
# 0    Die Gefahr wird zum Beispiel in ein Risiko umg...
# 1    Ein Sturz das heißt ein Fall von der Kante ist...
# Name: Was_ist_der_Unterschied_zwischen_Gefahr_und_Risiko?, dtype: object

顺便说一句，我不会在缩写中包含空格。而是把整个句子分成几个单词。然后将列表中的每个单词提供给字典，如果没有匹配项，则使用默认值

相关问题更多 >

编程相关推荐

热门问题

热门文章