Python Pandas 子字符串

1 投票

1 回答

7602 浏览

提问于 2025-04-17 20:41

我有一个 pandas 数据框，其中有一列是字符串。这份数据框的行数超过 200 万，逐行提取我需要的元素效率太低了。我的当前代码如下：

for i in range(len(table["series_id"])):
    table["state_code"] = table["series_id"][i][2:4]
    table["area_code"] = table["series_id"][i][5:9]
    table["supersector_code"] = table["series_id"][i][11:12]

这里的 "series_id" 是一个字符串，里面包含了我想要提取的多个信息字段。我想创建一个示例数据元素：

列：

 [series_id, year, month, value, footnotes]

数据：

[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
 ['SMS01000000000000001' '2006' 'M02' 1970.4 '']
 ['SMS01000000000000001' '2006' 'M03' 1976.6 '']

不过，我现在遇到的问题是 "series_id" 这一列。我查看了 Python 和 pandas 中的 str.FUNCTION。

http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern

这个链接里有一部分介绍了每个字符串函数，比如我想用的 get 和 slice 函数。理想情况下，我希望能找到这样的解决方案：

table["state_code"] = table["series_id"].str.get(1:3)

或者

table["state_code"] = table["series_id"].str.slice(1:3)

或者

table["state_code"] = table["series_id"].str.slice([1:3])

但是，当我尝试以下函数时，遇到了 ":" 的语法错误。

可惜我还是没能找到在 pandas 数据框的列上进行子字符串操作的正确方法。

谢谢

性能优化字符串处理数据分析语法错误数据框 pandas库子字符串提取 str函数

1 个回答

我觉得可以用 str.extract 结合一些正则表达式（你可以根据自己的需求调整）来解决这个问题：

In [11]: s = pd.Series(["SMU78000009092000001"])

In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]: 
  state_code area_code supersector_code
0        U78      0000               92

这段话的意思是：字符串以任意两个字符开始（这两个字符会被忽略），接下来的三个字符是 state_code，然后是一个被忽略的字符，接着是四个数字，这四个数字是 area_code，...

回答于 2025-04-17 由 Python大师

分享举报

Python Pandas 子字符串

1 个回答

撰写回答