回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我需要一个函数来给出一个字符串列表与一个更大的字符串最匹配的索引。在</p>
<p>例如:</p>
<p>给定字符串:</p>
<pre><code>text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
</code></pre>
<p>以及字符串列表:</p>
^{pr2}$
<p>是否可以创建函数来生成:</p>
<pre><code>indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]
</code></pre>
<hr/>
<hr/>
<p>下面是我创建的一个脚本来说明这一点:</p>
<pre><code>from re import split
from numpy import vstack, zeros
import numpy as np
# I need a function which takes a string and the tokenized list
# and returns the indices for which the tokens were split at
def index_of_split(text_str, list_of_strings):
#?????
return indices
# The text string, string token list, and character binary annotations
# are all given
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
# (This binary array labels the following terms ['Kir4.3', 'Dextran-sulfate', 'glucose'])
bin_ann = [1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
# Here we would apply our function
indices = index_of_split(text, tok)
# This list is the desired output
#indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]
# We could now split the binary array based on these indices
bin_ann_toked = np.split(bin_ann, indices)
# and combine with the tokenized list
tokenized_strings = np.vstack((tok, bin_ann_toked)).T
# Then we can remove the trailing zeros,
# which are likely caused from spaces,
# or other non tokenized text
for i, el in enumerate(tokenized_strings):
tokenized_strings[i][1] = el[1][:len(el[0])]
print(tokenized_strings)
</code></pre>
<p>如果函数按所述工作,则该<em>将提供以下输出:</p>
<pre><code>[['Kir4.3' array([1, 1, 1, 1, 1, 1])]
['is' array([0, 0])]
['a' array([0])]
['inwardly-rectifying'
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
['potassium' array([0, 0, 0, 0, 0, 0, 0, 0, 0])]
['channel' array([0, 0, 0, 0, 0, 0, 0])]
['.' array([0])]
['Dextran-sulfate' array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])]
['is' array([0, 0])]
['useful' array([0, 0, 0, 0, 0, 0])]
['in' array([0, 0])]
['glucose' array([1, 1, 1, 1, 1, 1, 1])]
['-' array([0])]
['mediated' array([0, 0, 0, 0, 0, 0, 0, 0])]
['channels' array([0, 0, 0, 0, 0, 0, 0, 0])]
['.' array([0])]]
</code></pre>