如何在整个数据帧中用不同的长字符串替换较短的字符串?

2024-06-16 17:40:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用短得多的字符串替换数据帧中的长字符串。我有一本简短的词典,里面有我想替换的内容。你知道吗

import pandas as pd
from StringIO import StringIO

replacement_dict = {
    "substring1": "substring1",
    "substring2": "substring2",
    "a short substring": "substring3",
}

exampledata = StringIO("""id;Long String
1;This is a long substring1 of text that has lots of words
2;This is substring2 and also contains more text than needed
3;This is a long substring1 of text that has lots of words
4;This is substring2 and also contains more text than needed
5;This is substring2 and also contains more text than needed
6;This is substring2 and also contains more text than needed
7;Within this string is a short substring that is unique
8;This is a long substring1 of text that has lots of words
9;Within this string is a short substring that is unique
10;Within this string is a short substring that is unique
""")

df = pd.read_csv(exampledata, sep=";")
print df

for s in replacement_dict.keys():
    if df['Long String'].str.contains(s):
        df['Long String'] = replacement_dict[df['Long String'].str.contains(s)]

预期的数据帧如下所示:

   id  Long String
0   1  substring1
1   2  substring2
2   3  substring1
3   4  substring2
4   5  substring2
5   6  substring2
6   7  substring3
7   8  substring1
8   9  substring3
9  10  substring3

当我运行上面的代码时,出现以下错误:

Traceback (most recent call last):
  File "test.py", line 27, in <module>
    if df['Long String'].str.contains(s):
  File "h:\Anaconda\lib\site-packages\pandas\core\generic.py", line 731, in __nonzero__.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

如何在整个数据帧中用不同的长字符串替换较短的字符串?你知道吗


Tags: of字符串textdfstringthatissubstring
1条回答
网友
1楼 · 发布于 2024-06-16 17:40:33

你可以用^{}做这类事情。然而,你必须稍微修改一下你的字典才能得到你想要的结果。你知道吗

replacement_dict = {
    ".*substring1.*": "substring1",
    ".*substring2.*": "substring2",
    ".*a short substring.*": "substring3",
}

我所做的使键成为正则表达式字符串。它将匹配要匹配的子字符串之前和之后的所有内容。这件事很重要。你知道吗

接下来,用以下内容替换整个for循环:

df['Long String'] = df['Long String'].replace(replacement_dict, regex=True)

.replace()可以使用字典,其中键是要匹配的字符串,值是替换文本。之所以更改键来捕获子字符串前后的所有内容,是因为我们现在可以替换整个值,而不仅仅是一个小的匹配字符串。你知道吗

例如,没有.*部分的字典将转换为如下数据帧:

   id                                        Long String
0   1  This is a long substring1 of text that has lot...
1   2  This is substring2 and also contains more text...
2   3  This is a long substring1 of text that has lot...
3   4  This is substring2 and also contains more text...
4   5  This is substring2 and also contains more text...
5   6  This is substring2 and also contains more text...
6   7    Within this string is substring3 that is unique
7   8  This is a long substring1 of text that has lot...
8   9    Within this string is substring3 that is unique
9  10    Within this string is substring3 that is unique

请注意,您真正看到的唯一更改是使用“short substring”值,因为您实际上只是用自身替换“substring1”和“substring2”。你知道吗

现在,如果我们重新添加regex通配符,我们会得到:

   id Long String
0   1  substring1
1   2  substring2
2   3  substring1
3   4  substring2
4   5  substring2
5   6  substring2
6   7  substring3
7   8  substring1
8   9  substring3
9  10  substring3

相关问题 更多 >