我想用短得多的字符串替换数据帧中的长字符串。我有一本简短的词典,里面有我想替换的内容。你知道吗
import pandas as pd
from StringIO import StringIO
replacement_dict = {
"substring1": "substring1",
"substring2": "substring2",
"a short substring": "substring3",
}
exampledata = StringIO("""id;Long String
1;This is a long substring1 of text that has lots of words
2;This is substring2 and also contains more text than needed
3;This is a long substring1 of text that has lots of words
4;This is substring2 and also contains more text than needed
5;This is substring2 and also contains more text than needed
6;This is substring2 and also contains more text than needed
7;Within this string is a short substring that is unique
8;This is a long substring1 of text that has lots of words
9;Within this string is a short substring that is unique
10;Within this string is a short substring that is unique
""")
df = pd.read_csv(exampledata, sep=";")
print df
for s in replacement_dict.keys():
if df['Long String'].str.contains(s):
df['Long String'] = replacement_dict[df['Long String'].str.contains(s)]
预期的数据帧如下所示:
id Long String
0 1 substring1
1 2 substring2
2 3 substring1
3 4 substring2
4 5 substring2
5 6 substring2
6 7 substring3
7 8 substring1
8 9 substring3
9 10 substring3
当我运行上面的代码时,出现以下错误:
Traceback (most recent call last):
File "test.py", line 27, in <module>
if df['Long String'].str.contains(s):
File "h:\Anaconda\lib\site-packages\pandas\core\generic.py", line 731, in __nonzero__.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
如何在整个数据帧中用不同的长字符串替换较短的字符串?你知道吗
你可以用^{} 做这类事情。然而,你必须稍微修改一下你的字典才能得到你想要的结果。你知道吗
我所做的使键成为正则表达式字符串。它将匹配要匹配的子字符串之前和之后的所有内容。这件事很重要。你知道吗
接下来,用以下内容替换整个
for
循环:.replace()
可以使用字典,其中键是要匹配的字符串,值是替换文本。之所以更改键来捕获子字符串前后的所有内容,是因为我们现在可以替换整个值,而不仅仅是一个小的匹配字符串。你知道吗例如,没有
.*
部分的字典将转换为如下数据帧:请注意,您真正看到的唯一更改是使用“short substring”值,因为您实际上只是用自身替换“substring1”和“substring2”。你知道吗
现在,如果我们重新添加regex通配符,我们会得到:
相关问题 更多 >
编程相关推荐