使用beautifulsoup4解析Python-xml并将输出写入mysql数据库 - Unicode错误

Question

我正在尝试使用beautifulsoup4来解析一个xml文件。
开发环境：LICLIPSE
Python版本：2.7
xml编码：utf-8

下面是我用来解析xml文件并将提取的信息写入本地mysql数据库的代码。

from bs4 import BeautifulSoup
import pymysql
import os, os.path

#strips apostrophes from the text and then just adds them at the beginning and end for the query
def apostro(text):

    text= text.replace("'","")
    text= text.replace(",","")
    text = "'"+text+"'"
    return text

#sets up the MYSQL connection
conn = pymysql.connect(host='127.0.0.1',  user='xxxx', passwd='xxxx', db='mysql', port= 3306 )
cur = conn.cursor()

#drop all of the previous values from the database
cur.execute("DELETE FROM db WHERE title is not null")
conn.commit()

#loop through all of the files
for root, _, files in os.walk("C:/usc/xml"):
    for f in files:
        #j is a counter for how many sections we have processed
        j=0
        #fullpath is the location of the file we're parsing
        fullpath = os.path.join(root, f)
        print(fullpath)

        #open file using BeautifulSoup
        soup = BeautifulSoup(open(""+fullpath+""), 'xml')
        sec = soup.find_all("section", {"style" : "-uslm-lc:I80"})
        t = soup.main.title
        t_num = t.num['value']

        #if not clauses are needed in case there is a blank, otherwise an error is thrown
        if not t.heading.text:
            t_head = ''
        else:
            t_head = t.heading.text.encode('ascii', 'ignore').encode("UTF-8")

        for element in sec:
            if not element.num['value']:
                section = ''
            else:
                section = element.num['value'].encode('ascii', 'ignore').encode("UTF-8")

            if not element.heading:
                s_head = ''
            else:
                s_head = element.heading.text.encode('ascii', 'ignore').encode("UTF-8")

            if not element.text:
                s_text = ''
            else:
                s_text = element.text.encode('ascii', 'ignore').encode("UTF-8")

            #inserttest is the sql command that 'cur' executes. counter is printed every time a section is written to let me know the program is still alive
            inserttest = "insert into deadlaws.usc_new (title, t_head, section, s_head, s_text) values (" + t_num + "," + apostro(t_head) + "," + apostro(section) + "," + apostro(s_head) + "," + apostro(s_text) +")"
            j=j+1
            cur.execute( inserttest)
            conn.commit()
            print(fullpath + " " +str(j))

conn.commit()
cur.close()
conn.close()

一切进展顺利，直到我发现程序忽略了章节编号中的连字符'-'，这导致整个活动出错。

我知道在编码语句中使用了'ignore'，但是连字符'-'在ascii中是一个合法字符，对吧？难道不应该把这个字符写入数据库，而不是忽略它吗？

我在StackOverflow和其他地方做了很多阅读。

我尝试在soup语句中加入from_encoding="utf-8"，在encode()语句中使用'xmlrefreplace'以及其他方法，但结果是下面的输出：它把这个aâ€“（某个特殊的unicode字符）写入数据库，而不是连字符'-'。

示例输出：

数据量很大，我担心程序可能还会忽略其他像-这样的字符。如果它忽略t_head、s_head和s_text字段中的特殊字符没关系，因为它们是文本，但章节列就不行。

如果能帮我解决这个问题，我将非常感激。

数据提取编码问题 xml解析 MySQL数据库数据库插入字符处理 unicode错误 beautifulsoup4

使用beautifulsoup4解析Python-xml并将输出写入mysql数据库 - Unicode错误

1 个回答

撰写回答