在Python中,如何高效地分块UTF-8字符串以进行REST传输?

0 投票
1 回答
1433 浏览
提问于 2025-04-18 00:21
  1. 我先说一下,我大概明白什么是“UTF-8”编码,它基本上是Unicode的一种,但又不完全一样,而ASCII是一种更小的字符集。我也明白,如果我有:

    se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word tr <excess removed ...> JV"
    print len(se_body)              #will return the number of characters in the string, in my case '1500'
    print sys.getsizeof(se_body)    #will return the number of bytes, which will be 3050
    
  2. 我的代码在使用一个我无法控制的RESTful API。这个API的作用是从文本中解析出传入的圣经引用,而且它有个有趣的特点——它一次只能接受2000个字符。如果发送的字符超过2000个,我的API调用就会返回404错误。再次强调,我在使用别人的API,所以请不要告诉我“修复服务器端”。我做不到 :)

  3. 我的解决方案是将字符串分成小于2000个字符的块,让它逐个扫描这些块,然后我再根据需要重新组合和标记。我希望能对这个服务友好一点,尽量传递尽可能少的块,这意味着每个块应该尽量大。

  4. 我的问题是,当我传递包含希伯来文或希腊文字符的字符串时就出现了问题。(没错,圣经的答案经常会用到希腊文和希伯来文!)如果我把块的大小设置得低到1000个字符,我总是可以安全地传递,但这看起来实在太小了。在大多数情况下,我应该能把块做得更大。

  5. 我的问题是:在不做太多复杂操作的情况下,我怎样才能有效地将UTF-8字符串分块到合适的大小呢?

这是代码:

# -*- coding: utf-8 -*-
import requests
import json

biblia_apikey = '************'
refparser_url = "http://api.biblia.com/v1/bible/scan/?"
se_body = "&gt; Genesis 2:2 וַיְכַל אֱלֹהִים בַּיֹּום הַשְּׁבִיעִי מְלַאכְתֹּו אֲשֶׁר עָשָׂה וַיִּשְׁבֹּת בַּיֹּום הַשְּׁבִיעִי מִכָּל־מְלַאכְתֹּו אֲשֶׁר עָשָֽׂה׃ The word translated as &quot;rest&quot; in English, is actually the conjugated word from which we get the English word `Sabbath`, which actually means to &quot;cease doing&quot;. &gt; וַיִּשְׁבֹּת or by its root: &gt; שָׁבַת Here&#39;s BlueletterBible&#39;s concordance entry: [Strong&#39;s H7673][1] It is actually the same root word that is conjugated to mean &quot;[to go on strike][2]&quot; in modern Hebrew. In Genesis it is used to refer to the fact that the creation process ceased, not that God &quot;rested&quot; in the sense of relieving exhaustion, as we would normally understand the term in English. The word &quot;rest&quot; in that sense is &gt; נוּחַ Which can be found in Genesis 8:9, for example (and is also where we get Noah&#39;s name). More here: [Strong&#39;s H5117][3] Jesus&#39; words are in reference to the fact that God is always at work, as the psalmist says in Psalm 54:4, He is the sustainer, something that implies a constant intervention (a &quot;work&quot; that does not cease). The institution of the Sabbath was not merely just so the Israelites would &quot;rest&quot; from their work but as with everything God institutes in the Bible, it had important theological significance (especially as can be gleaned from its prominence as one of the 10 commandments). The point of the Sabbath was to teach man that he should not think he is self-reliant (cf. instances such as Judges 7) and that instead they should rely upon God, but more specifically His mercy. The driving message throughout the Old Testament as well as the New (and this would be best extrapolated in c.se) is that man cannot, by his own efforts (&quot;works&quot;) reach God&#39;s standard: &gt; Ephesians 2:8 For by grace you have been saved through faith, and that not of yourselves; it is the gift of God, 9 not of works, lest anyone should boast. The Sabbath (and the penalty associated with breaking it) was a way for the Israelites to weekly remember this. See Hebrews 4 for a more in depth explanation of this concept. So there is no contradiction, since God never stopped &quot;working&quot;, being constantly active in sustaining His creation, and as Jesus also taught, the Sabbath was instituted for man, to rest, but also, to &quot;stop doing&quot; and remember that he is not self-reliant, whether for food, or for salvation. Hope that helps. [1]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H7673&amp;t=KJV [2]: http://www.morfix.co.il/%D7%A9%D7%91%D7%99%D7%AA%D7%94 [3]: http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?strongs=H5117&amp;t=KJV"

se_body = se_body.decode('utf-8')

nchunk_start=0
nchunk_size=1500
found_refs = []

while nchunk_start < len(se_body):
    body_chunk = se_body[nchunk_start:nchunk_size]
    if (len(body_chunk.strip())<4):
        break;

    refparser_params = {'text': body_chunk, 'key': biblia_apikey }
    headers = {'content-type': 'text/plain; charset=utf-8', 'Accept-Encoding': 'gzip,deflate,sdch'}
    refparse = requests.get(refparser_url, params = refparser_params, headers=headers)

    if (refparse.status_code == 200):
        foundrefs = json.loads(refparse.text)
        for foundref in foundrefs['results']:
            foundref['textIndex'] += nchunk_start
            found_refs.append( foundref ) 
    else:
        print "Status Code {0}: Failed to retrieve valid parsing info at {1}".format(refparse.status_code, refparse.url)
        print "  returned text is: =>{0}<=".format(refparse.text)

    nchunk_start += (nchunk_size-50)
    #Note: I'm purposely backing up, so that I don't accidentally split a reference across chunks


for ref in found_refs:
    print ref
    print se_body[ref['textIndex']:ref['textIndex']+ref['textLength']]

我知道怎么切分字符串(body_chunk = se_body[nchunk_start:nchunk_size]),但我不太确定如何根据UTF-8的字节长度来切分同样的字符串。

完成后,我需要提取出选定的引用(我实际上会添加SPAN标签)。不过现在输出大概是这样的:

{u'textLength': 11, u'textIndex': 5, u'passage': u'Genesis 2:2'}
Genesis 2:2
{u'textLength': 11, u'textIndex': 841, u'passage': u'Genesis 8:9'}
Genesis 8:9

1 个回答

2

这里有几种不同的大小:

  1. 通过 sys.getsizeof() 返回的内存大小,例如:

    >>> import sys
    >>> sys.getsizeof(b'a')
    38
    >>> sys.getsizeof(u'Α')
    56
    

    也就是说,一个包含单个字节 b'a' 的字节串可能在内存中需要 38 字节。
    除非你的电脑有内存问题,否则你不需要太在意这个

  2. 文本以 utf-8 编码后的字节数:

    >>> unicode_text = u'Α' # greek letter
    >>> bytestring = unicode_text.encode('utf-8')
    >>> len(bytestring)
    2
    
  3. 文本中的 Unicode 码点数量:

    >>> unicode_text = u'Α' # greek letter
    >>> len(unicode_text)
    1
    

    一般来说,你可能还会对文本中的图形簇(“可视字符”)数量感兴趣:

    >>> unicode_text = u'ё' # cyrillic letter
    >>> len(unicode_text) # number of Unicode codepoints
    2
    >>> import regex # $ pip install regex
    >>> chars = regex.findall(u'\\X', unicode_text)
    >>> chars
    [u'\u0435\u0308']
    >>> len(chars) # number of "user-perceived characters"
    1
    

如果 API 的限制是由第 2 点(utf-8 编码字节串的字节数)定义的,那么你可以参考 @Martijn Pieters 提到的问题中的答案:在编码以便传输时截断 Unicode 以适应最大大小。第一个答案应该适用:

truncated = unicode_text.encode('utf-8')[:2000].decode('utf-8', 'ignore')

还有一种可能是长度受到 URL 长度的限制:

>>> import urllib
>>> urllib.quote(u'\u0435\u0308'.encode('utf-8'))
'%D0%B5%CC%88'

要截断它:

import re
import urllib

urlencoded = urllib.quote(unicode_text.encode('utf-8'))[:2000]
# remove `%` or `%X` at the end
urlencoded = re.sub(r'%[0-9a-fA-F]?$', '', urlencoded) 
truncated = urllib.unquote(urlencoded).decode('utf-8', 'ignore')

关于 URL 长度的问题,可以通过使用 'X-HTTP-Method-Override' 这个 HTTP 头来解决,这样可以将 GET 请求转换为 POST 请求,前提是服务支持这种方式。这里有一个 使用 Google 翻译 API 的代码示例

如果在你的情况下允许的话,你可以通过解码 HTML 引用并使用 NFC Unicode 规范化形式来压缩 HTML 文本,从而合并一些 Unicode 码点:

import unicodedata
from HTMLParser import HTMLParser

unicode_text = unicodedata.normalize('NFC', HTMLParser().unescape(unicode_text))

撰写回答