<h2>简而言之:</h2>
<blockquote>
<h3><code>regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))</code></h3>
<p>This expression extracts <strong><em>employee name</em></strong> from <strong>any position</strong> where it is after <strong>by</strong> then <strong>space(s)</strong> in text column(<code>col('Notes')</code>)</p>
</blockquote>
<hr/>
<h2>详细说明:</h2>
<p>创建示例数据帧</p>
<pre><code>data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+
</code></pre>
<p>做必要的进口</p>
<pre><code>from pyspark.sql.functions import regexp_extract, col
</code></pre>
<p>在<code>df</code>上,使用<code>regexp_extract(column_name, regex, group_number)</code>从列中提取<code>Employee</code>名称。</p>
<p>这里的regex是指</p>
<ul>
<li><em>(.)</em>-任何字符(换行符除外)</li>
<li><em>(by)</em>-文本中的单词<strong>by</strong></li>
<li><em>(\s+</em>-一个或多个空格</li>
<li><em>(\w+</em>-长度为1的字母数字或下划线字符</li>
</ul>
<p>因为在表达式中,{<cd5>}组位于第4位,所以<strong>组数</strong>是4</p>
<pre><code>result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+
</code></pre>
<p><a href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5537430417240233/3957787748072785/3506802399907740/latest.html" rel="noreferrer">Databricks notebook</a></p>
<h2>注:</h2>
<blockquote>
<p><code>regexp_extract(col('Notes'), '.by\s+(\w+)', 1))</code> seems much cleaner version and <a href="https://regex101.com/r/2lk6eV/3" rel="noreferrer">check the Regex in use here</a></p>
</blockquote>