在Python中，如何高效地分块UTF-8字符串以进行REST传输？

Question

我先说一下，我大概明白什么是“UTF-8”编码，它基本上是Unicode的一种，但又不完全一样，而ASCII是一种更小的字符集。我也明白，如果我有：

se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word tr <excess removed ...> JV"
print len(se_body)              #will return the number of characters in the string, in my case '1500'
print sys.getsizeof(se_body)    #will return the number of bytes, which will be 3050

我的代码在使用一个我无法控制的RESTful API。这个API的作用是从文本中解析出传入的圣经引用，而且它有个有趣的特点——它一次只能接受2000个字符。如果发送的字符超过2000个，我的API调用就会返回404错误。再次强调，我在使用别人的API，所以请不要告诉我“修复服务器端”。我做不到 :)
我的解决方案是将字符串分成小于2000个字符的块，让它逐个扫描这些块，然后我再根据需要重新组合和标记。我希望能对这个服务友好一点，尽量传递尽可能少的块，这意味着每个块应该尽量大。
我的问题是，当我传递包含希伯来文或希腊文字符的字符串时就出现了问题。（没错，圣经的答案经常会用到希腊文和希伯来文！）如果我把块的大小设置得低到1000个字符，我总是可以安全地传递，但这看起来实在太小了。在大多数情况下，我应该能把块做得更大。
我的问题是：在不做太多复杂操作的情况下，我怎样才能有效地将UTF-8字符串分块到合适的大小呢？

这是代码：

# -*- coding: utf-8 -*-
import requests
import json

biblia_apikey = '************'
refparser_url = "http://api.biblia.com/v1/bible/scan/?"
se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as &quot;rest&quot; in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to &quot;cease doing&quot;. &gt; וַיִּשְׁבֹּת or by its root: &gt; שָׁבַת Here&#39;s BlueletterBible&#39;s concordance entry: [Strong&#39;s H7673][1] It is actually the same root word that is conjugated to mean &quot;[to go on strike][2]&quot; in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God &quot;rested&quot; in the sense of relieving exhaustion, as we would normally understand the term in English. The word &quot;rest&quot; in that sense is &gt; נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah&#39;s name). More here: [Strong&#39;s H5117][3] Jesus&#39; words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a &quot;work&quot; that does not cease). The institution of the Sabbath was not merely just so the Israelites would &quot;rest&quot; from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts (&quot;works&quot;) reach God&#39;s standard: &gt; Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped &quot;working&quot;, being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to &quot;stop doing&quot; and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&amp;t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&amp;t=KJV"

se_body = se_body.decode('utf-8')

nchunk_start=0
nchunk_size=1500
found_refs = []

while nchunk_start < len(se_body):
    body_chunk = se_body[nchunk_start:nchunk_size]
    if (len(body_chunk.strip())<4):
        break;

    refparser_params = {'text': body_chunk, 'key': biblia_apikey }
    headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'}
    refparse = requests.get(refparser_url, params = refparser_params, headers=headers)

    if (refparse.status_code == 200):
        foundrefs = json.loads(refparse.text)
        for foundref in foundrefs['results']:
            foundref['textIndex'] += nchunk_start
            found_refs.append( foundref ) 
    else:
        print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url)
        print "  returned text is: =>{0}<=".format(refparse.text)

    nchunk_start += (nchunk_size-50)
    #Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks


for ref in found_refs:
    print ref
    print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']]

我知道怎么切分字符串（body_chunk = se_body[nchunk_start:nchunk_size]），但我不太确定如何根据UTF-8的字节长度来切分同样的字符串。

完成后，我需要提取出选定的引用（我实际上会添加SPAN标签）。不过现在输出大概是这样的：

{u'textLength': 11, u'textIndex': 5, u'passage': u'Genesis 2:2'}
Genesis 2:2
{u'textLength': 11, u'textIndex': 841, u'passage': u'Genesis 8:9'}
Genesis 8:9

字符串处理字符编码 utf-8 数据传输 rest api 希伯来文字符串分块希腊文

在Python中，如何高效地分块UTF-8字符串以进行REST传输？

1 个回答

撰写回答