<p>简单地</p>
<pre><code>df['word'] = df['word'].str.strip()
</code></pre>
<p>它应该删除文本两侧的所有<code>spaces</code>、<code>tabs</code>和<code>new lines</code></p>
<hr/>
<p><strong>顺便说一句:</strong></p>
<p>可能您甚至可以在不使用<code>split("; ")</code>、<code>split(" ;")</code>等的情况下使用<code>split(";")</code>,因为<code>strip()</code>将删除这些空格</p>
<hr/>
<p>如果您想使用像<code>split(";")</code>、<code>split("; ")</code>、<code>split(" ;")</code>、<code>split(" ; ")</code>这样的变体,那么您应该从最长的<code>split(" ; ")</code>开始,然后使用较短的<code>split("; ")</code>、<code>split(" ;")</code>,最后使用最短的<code>split(";")</code>,这样可能可以删除空格</p>
<hr/>
<p>您甚至可以尝试只使用一个<code>split('[;,-./]')</code>而不是所有的<code>split()</code></p>
<pre><code>df = df['word'].str.split('[;,-./]').explode().str.strip()
</code></pre>
<p>最终,您可以使用<code>|</code>作为<code>OR</code></p>
<hr/>
<p><strong>编辑:</strong></p>
<p>在代码中直接使用数据的最小工作示例-因此每个人都可以对其进行测试</p>
<pre><code>import pandas as pd
import io
text = '''name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine"
Sergi,"runner . athletics"'''
# text to dataframe
df = pd.read_csv(io.StringIO(text))
df['word'] = df['word'].str.split('[;,/]|\. |- | -')
df = df.explode('word')
df['word'] = df['word'].str.strip()
# dataframe to text
output = io.StringIO()
df.to_csv(output, index=False)
output.seek(0)
text = output.read()
print(text)
</code></pre>
<p>结果:</p>
<pre><code>name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Sergi,runner
Sergi,athletics
</code></pre>
<hr/>
<p><strong>编辑:</strong></p>
<p>没有<code>strip()</code>的情况也一样</p>
<p>我使用<code>' ?'</code>在chars<code>;,/</code>之后和char<code>.</code>之前获取可选的<code>space</code></p>
<p>我还使用<code>' - '</code>before <code>'- '</code>和<code>' -'</code>来查找最长的版本</p>
<pre><code>df['word'] = df['word'].str.split('[;,/] ?| ?\. | - |- | -')
df = df.explode('word')
</code></pre>
<hr/>
<p><strong>编辑:</strong></p>
<p>使用替换将<code>(data, science)</code>保留为一个字符串而不拆分的示例</p>
<pre><code>import pandas as pd
import io
text = '''name,word
Oliver,"water,surf,windsurf"
Tom,"football, striker, ball"
Anna,"mountain;wind;sun"
Sara,"basketball; nba; ball; (date1, time1)"
Mark,"informatic/web3.0/e-learning"
Christian,"doctor - medicine - (date2, time2) - date3, time3"
Sergi,"runner . athletics"'''
# text to dataframe
df = pd.read_csv(io.StringIO(text))
# Find all `(...)`
found = df['word'].str.findall(r'\(.*?\)')
print(found)
# Flatten it
found = sum(found, [])
print(found)
# Create dict to put pattern in place of `(...)`.
# Because later I will use `regex=True` so I have to use `\(...\)` instead of `(...)`
patterns = {f'\({value[1:-1]}\)':f'XXX{i}' for i, value in enumerate(found)}
print(patterns)
df['word'] = df['word'].replace(patterns, regex=True)
# - nromal spliting -
df['word'] = df['word'].str.split('[;,/]|\. |- | -')
df = df.explode('word')
df['word'] = df['word'].str.strip()
# Create dict to put later `(...)` in place of pattern.
patterns_back = {f'XXX{i}':value for i, value in enumerate(found)}
print(patterns_back)
df['word'] = df['word'].replace(patterns_back, regex=True)
# dataframe to text
output = io.StringIO()
df.to_csv(output, index=False)
output.seek(0)
text = output.read()
print(text)
</code></pre>
<p>结果:</p>
<pre><code>0 []
1 []
2 []
3 [(date1, time1)]
4 []
5 [(date2, time2)]
6 []
Name: word, dtype: object
['(date1, time1)', '(date2, time2)']
{'\\(date1, time1\\)': 'XXX0', '\\(date2, time2\\)': 'XXX1'}
{'XXX0': '(date1, time1)', 'XXX1': '(date2, time2)'}
name,word
Oliver,water
Oliver,surf
Oliver,windsurf
Tom,football
Tom,striker
Tom,ball
Anna,mountain
Anna,wind
Anna,sun
Sara,basketball
Sara,nba
Sara,ball
Sara,"(date1, time1)"
Mark,informatic
Mark,web3.0
Mark,e-learning
Christian,doctor
Christian,medicine
Christian,"(date2, time2)"
Christian,date3
Christian,time3
Sergi,runner
Sergi,athletics
</code></pre>