<h2>选项1:逗号拆分</h2>
<p>它是否可以简单到用逗号拆分字符串,然后在拆分后使用最后一个标记/第一个标记?在</p>
<pre><code>addresses = ["xxx Richardson, TX", "xxyy Wylie, TX WO-65758"]
for a in addresses:
asplit = a.split(",")
city = asplit[0].split()[-1]
state = asplit[1].split()[0]
print(", ".join([city, state]))
#Richardson, TX
#Wylie, TX
</code></pre>
<hr/>
<p><strong>示例</strong></p>
<p>如果您有以下数据帧:</p>
^{pr2}$
<p>您可以定义拆分函数:</p>
^{3}$
<p>然后<code>apply()</code>返回address列,这将返回两个新列,<code>join()</code>返回原始数据帧:</p>
<pre><code>df.join(
df['Address'].apply(
lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
)
)
# Address City State
#0 xxx Richardson, TX Richardson TX
#1 yyy Plano, TX Plano TX
#2 xxyy Wylie, TX WO-65758 Wylie TX
#3 zzz Waxahachie, TX WO-999786 Waxahachie TX
</code></pre>
<hr/>
<h2>选项2:使用正则表达式</h2>
<p>如果这不起作用,那么使用regex模式进行匹配怎么样?在</p>
<p>这个应该有用:</p>
<pre><code>import re
pattern = r"[A-Z][a-z]+,\s[A-Z]{2}"
for a in addresses:
matches = re.finditer(pattern, a, re.MULTILINE)
for match in matches:
city, state = match.group().replace(",", "").split()
print(", ".join([city, state]))
#Richardson, TX
#Wylie, TX
</code></pre>
<p>哪一个匹配:</p>
<ul>
<li><code>[A-Z]</code>:一个大写字母</li>
<li><code>[a-z]+</code>:任意数量的小写字母</li>
<li><code>,\s</code>:逗号后跟空格</li>
<li><code>[A-Z]{2}</code>:2个大写字母</li>
</ul>
<p><a href="https://regex101.com/r/p5S7tv/2" rel="nofollow noreferrer">Demo on Regex101</a></p>
<hr/>
<p><strong>示例</strong></p>
<pre><code>df.join(
df['Address'].str.extract(
r"((?P<City>[A-Z][a-z]+),\s(?P<State>[A-Z]{2}))",
expand=False
)[["City", "State"]]
)
# Address City State
#0 xxx Richardson, TX Richardson TX
#1 yyy Plano, TX Plano TX
#2 xxyy Wylie, TX WO-65758 Wylie TX
#3 zzz Waxahachie, TX WO-999786 Waxahachie TX
</code></pre>
<p><strong>注意事项</strong></p>
<ul>
<li>这不适用于有空格的城市名称,例如“德克萨斯州圣安东尼奥”。在</li>
</ul>