使用beautifulsoup4解析Python-xml并将输出写入mysql数据库 - Unicode错误
我正在尝试使用beautifulsoup4来解析一个xml文件。
开发环境:LICLIPSE
Python版本:2.7
xml编码:utf-8
示例xml文件:http://pastebin.com/RhjvyKDN
下面是我用来解析xml文件并将提取的信息写入本地mysql数据库的代码。
from bs4 import BeautifulSoup
import pymysql
import os, os.path
#strips apostrophes from the text and then just adds them at the beginning and end for the query
def apostro(text):
text= text.replace("'","")
text= text.replace(",","")
text = "'"+text+"'"
return text
#sets up the MYSQL connection
conn = pymysql.connect(host='127.0.0.1', user='xxxx', passwd='xxxx', db='mysql', port= 3306 )
cur = conn.cursor()
#drop all of the previous values from the database
cur.execute("DELETE FROM db WHERE title is not null")
conn.commit()
#loop through all of the files
for root, _, files in os.walk("C:/usc/xml"):
for f in files:
#j is a counter for how many sections we have processed
j=0
#fullpath is the location of the file we're parsing
fullpath = os.path.join(root, f)
print(fullpath)
#open file using BeautifulSoup
soup = BeautifulSoup(open(""+fullpath+""), 'xml')
sec = soup.find_all("section", {"style" : "-uslm-lc:I80"})
t = soup.main.title
t_num = t.num['value']
#if not clauses are needed in case there is a blank, otherwise an error is thrown
if not t.heading.text:
t_head = ''
else:
t_head = t.heading.text.encode('ascii', 'ignore').encode("UTF-8")
for element in sec:
if not element.num['value']:
section = ''
else:
section = element.num['value'].encode('ascii', 'ignore').encode("UTF-8")
if not element.heading:
s_head = ''
else:
s_head = element.heading.text.encode('ascii', 'ignore').encode("UTF-8")
if not element.text:
s_text = ''
else:
s_text = element.text.encode('ascii', 'ignore').encode("UTF-8")
#inserttest is the sql command that 'cur' executes. counter is printed every time a section is written to let me know the program is still alive
inserttest = "insert into deadlaws.usc_new (title, t_head, section, s_head, s_text) values (" + t_num + "," + apostro(t_head) + "," + apostro(section) + "," + apostro(s_head) + "," + apostro(s_text) +")"
j=j+1
cur.execute( inserttest)
conn.commit()
print(fullpath + " " +str(j))
conn.commit()
cur.close()
conn.close()
一切进展顺利,直到我发现程序忽略了章节编号中的连字符'-'
,这导致整个活动出错。
我知道在编码语句中使用了'ignore'
,但是连字符'-'
在ascii中是一个合法字符,对吧?难道不应该把这个字符写入数据库,而不是忽略它吗?
我在StackOverflow和其他地方做了很多阅读。
我尝试在soup语句中加入from_encoding="utf-8"
,在encode()语句中使用'xmlrefreplace'
以及其他方法,但结果是下面的输出:它把这个a–
(某个特殊的unicode字符)写入数据库,而不是连字符'-'
。
示例输出:
数据量很大,我担心程序可能还会忽略其他像-
这样的字符。如果它忽略t_head
、s_head
和s_text
字段中的特殊字符没关系,因为它们是文本,但章节列就不行。
如果能帮我解决这个问题,我将非常感激。
1 个回答
0
不要进行编码,MySQL库完全可以直接将Unicode文本插入到数据库中。使用SQL参数,而不是字符串插值,并且在连接数据库时指定字符集:
conn = pymysql.connect(host='127.0.0.1', user='xxxx', passwd='xxxx',
db='mysql', port=3306,
charset='utf8')
不要进行编码:
t_head = t.heading.text or ''
for element in sec:
if not element.num['value']:
section = ''
else:
section = element.num.get('value', '')
s_head = element.heading.text or ''
s_text = element.text or ''
inserttest = "insert into deadlaws.usc_new (title, t_head, section, s_head, s_text) values (?, ?, ?, ?)"
cur.execute(inserttest, (t_num, t_head, section, s_head, s_text))