在尝试向xm中的新元素添加文本时出现UnicodeEncodeError

2024-03-29 07:43:10 发布

您现在位置:Python中文网/ 问答频道 /正文

过去几个小时我一直在努力克服这个问题。为了解决这个问题,我很难阅读文档和现有的论坛帖子。所以我想我应该在放弃之前,把这个地方作为解决这个问题的最后一搏。你知道吗

基本上,手头的任务是打开一个文件(实际上是许多文件),其中包含我想要放入新XML元素的文本。文本文件实际上是使用Python脚本创建的,因此它可以很好地处理UTF-16和UTF-8。但似乎每当我试图将文本内容放入内存以放入新的xml标记时(与以前写入新的文本文件不同),都会抛出以下错误消息:

"Traceback (most recent call last):
  File "K:\Users\Johnny\My Documents\PythonSandbox\websiteMigrationScripts\createXmlFile.py", line 87, in <module>
root[k][directionsBodyIndex].text = '<![CDATA[' + "".join(directionsBuffer) + ']]>'
  File "src/lxml/lxml.etree.pyx", line 1031, in lxml.etree._Element.text.__set__ (src\lxml\lxml.etree.c:55337)
  File "src/lxml/apihelpers.pxi", line 711, in lxml.etree._setNodeText (src\lxml\lxml.etree.c:24657)
  File "src/lxml/apihelpers.pxi", line 699, in lxml.etree._createTextNode (src\lxml\lxml.etree.c:24506)
  File "src/lxml/apihelpers.pxi", line 1431, in lxml.etree._utf8 (src\lxml\lxml.etree.c:32293)
  UnicodeEncodeError: 'utf-8' codec can't encode character '\udc92' in position 1862: surrogates not allowed"

我的脚本如下所示:

from bs4 import BeautifulSoup
import os, codecs
import imageFilesSub
import utf16FilesList
import openpyxl, lxml
from openpyxl.utils import get_column_letter, column_index_from_string

# First get the list of files to parse
filesDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\Website\tier 2 pages\tier 3 pages\tier 4 pages'
filesInDir = os.listdir(filesDir)
filesOutDir = r'.\blogsToParse'
filesToParse = []
for file in filesInDir:
    if (file.endswith('-template.html')) and not('travel-blog' in file) and not('accommodations' in file) and not('best-time-to-visit' in file) and not('activities' in file) and not('how-to-get-there' in file) and not('planning-and-preparing' in file) and not('restaurants' in file) and not('which-side' in file) and not('books-and-maps' in file):
        filesToParse.append(file)

# Then get a list of (unique) slugs that represent a unique row entry in the WoW Database
wowDatabaseDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\WordPressSite'
wowSpreadsheet = r'WoW Database for WP.xlsm'
wb = openpyxl.load_workbook(wowDatabaseDir + '\\' + wowSpreadsheet, data_only=True)
sheet = wb.active

# the following loop returns to maxRow the highest non-empty row
maxRow = 1  # openpyxl indexes from 1 not 0
for i in range(1, sheet.max_row): 
    if sheet.cell(row=i, column=33).value is None:
        pass
    else:
        maxRow = maxRow + 1

# now make a list containing the directory names of the writeups
writeupDirs = []
slugList = []
for i in range(3, maxRow + 1):
    writeupDirs.append(sheet.cell(row=i, column=18).value)
    slugList.append(sheet.cell(row=i, column=33).value)

from lxml import etree

xmlFile = 'WoW Database for WP 2017-01-01.xml'
data_file = wowDatabaseDir + '\\' + xmlFile
tree = etree.ElementTree(file=data_file)
root = tree.getroot()

k = 0
for element in root:
    try:
        element.attrib[root[k][0].tag] = root[k][0].text  # this puts Entry_No as an attribute of Row
        element.attrib[root[k][1].tag] = root[k][1].text  # this puts Waterfall Name as an attribute of Row
        root[k].append(etree.Element("Introduction_Body"))
        root[k].append(etree.Element("Directions_Body"))

        # need to go through some hoops and hurdles just to find the index of the desired tag (there must be a better way)
        children = []
        for child in root[k]:
            children.append(child.tag)
        fileDirIndex = children.index('File_directory')
        postSlugIndex = children.index('Post_Slug')
        introFilePtrIndex = children.index('Introduction_File_Ptr')
        introBodyIndex = children.index('Introduction_Body')
        introFile = wowDatabaseDir + '\\' + root[k][fileDirIndex].text + '\\' + root[k][introFilePtrIndex].text
        if root[k][postSlugIndex].text in utf16FilesList.utf16List:  # check the slug for unicode special handling
            inFile = open(introFile, 'r', encoding="utf-16", errors="surrogateescape")  # utf-16 works for Chinese, but not anything else
        else:
            inFile = open(introFile, 'r', encoding="utf-8", errors="surrogateescape")
        introBuffer = []
        for line in inFile:
            introBuffer.append(line)
        root[k][introBodyIndex].text = '<![CDATA[' + "".join(introBuffer) + ']]>'
        inFile.close()

        directionsFilePtrIndex = children.index('Directions_File_Ptr')
        directionsBodyIndex = children.index('Directions_Body')
        directionsFile = wowDatabaseDir + '\\' + root[k][fileDirIndex].text + '\\' + root[k][directionsFilePtrIndex].text
        if root[k][postSlugIndex].text in utf16FilesList.utf16List:  # check the slug for unicode special handling
            inFile = open(directionsFile, 'r', encoding="utf-16", errors="surrogateescape")  # utf-16 works for Chinese, but not anything else
        else:
            inFile = open(directionsFile, 'r', encoding="utf-8", errors="surrogateescape")
        directionsBuffer = []
        for line in inFile:
            directionsBuffer.append(line)
        root[k][directionsBodyIndex].text = '<![CDATA[' + "".join(directionsBuffer) + ']]>'
        inFile.close()
    except IndexError:
        pass
    k = k+1

有问题的文本文件(至少标记了第一个)如下所示:

<div class="ad-right">[adrotate banner="17"]</div>

Wapama Falls sits in Hetch Hetchy, which is in the remote northwest corner of Yosemite National Park.  We generally drive up to Yosemite Valley from Los Angeles before getting up to Hetch Hetchy so we'll describe this route first.  It typically takes us about 6 hours to make the drive from <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014181&aid=825833" target="_blank">Los Angeles</a> to Yosemite Valley.  We normally go from <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014181&aid=825833" target="_blank">Los Angeles</a> to <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20013079&aid=825833" target="_blank">Fresno</a> via the I-5 and Hwy 99, then through <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014886&aid=825833" target="_blank">Oakhurst</a> and <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20016735&aid=825833" target="_blank">Wawona</a> via the Hwy 41.  Once in Yosemite Valley, we'd drive west towards the Big Oak Flat Road where the Hwy 120 and Hwy 140 junction.  Then, we'd drive uphill on the Hwy 140 towards the Big Oak Flat Entrance (the Northwest Entrance), where we'd leave the park. 

From the Big Oak Flat Entrance on the Big Oak Flat Road (Route 120), we'd shortly have to turn right at the signed turnoff for <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a> and the Evergreen Road.  Then, we'd follow Evergreen Road for 7.5 miles to its junction with Hetch Hetchy Road in <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>.  Turning right onto Hetch Hetchy Road, we'd follow it to the parking lot by the O’Shaughnessy Dam after about seven miles.  On the way, we'd have passed through another entrance fee station.  The two-lane road was a bit narrow in places so we had to drive slowly.  Eventually, we'd reach a car park next to the dam.  The drive from Yosemite Valley to the car park at the O'Shaugnessy Dam took us less than 90 minutes.

From <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20015732&aid=825833" target="_blank">San Francisco</a>, we'd drive east towards <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20015274&aid=825833" target="_blank">Pleasanton</a>, then continue east on the I-205 towards the Hwy 120 passing through <a rel="nofollow" href="    http://www.booking.com/searchresults.html?city=20013298&aid=825833" target="_blank">Groveland</a> and eventually through the town of <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>.  Once we were east of <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>, we'd follow the road to the O'Shaugnessy Dam as described above.  Overall, this drive would take around 4 hours without traffic.

这恰好是正在处理的第35条记录,因此它成功地解析并填充到修改后的XML 34条先前记录中。这些文件是类似的文本文件(基本上是带有HTML标记和一些WordPress短代码的文本文件)。你知道吗

所以我很沮丧。我不明白这35个文件和之前的34个文件有什么不同。我也不知道这个非法角色最初是怎么进来的,也不知道怎么过去的。你知道吗

社区的任何帮助都将不胜感激。你知道吗

谢谢!你知道吗


Tags: andthetotextinforhtmlnot