将RSS数据插入Mysql数据库过程中出现UnicodeEncode错误

2024-04-16 17:29:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我目前正在从事个人项目,使新闻聚合。 为此,我从一些使用Pythonfeedparserlib的站点收集了一些关于文章的源代码。下面的代码是我的模块digitaltrends.py的函数。你知道吗

import feedparser
import requests
import re
import html

def cleanHtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr,'',raw_html)
    return cleantext

def to_utf(string):
    return string.encode('utf-8',errors='ignore').decode('utf-8')

def get_feed ():

    feed = 'https://www.digitaltrends.com/feed/'

    parsed_example = feedparser.parse (feed)
    press = to_utf(parsed_example['feed']['title'])
    articleList = []

    for items in parsed_example['entries'] :

        article = {}
        article['press'] = press
        article['link'] = to_utf(items.link)

        article['pubdate'] = to_utf(items.published)
        article['author'] =to_utf(items.authors[0]['name'])

        article['title'] = to_utf(items.title)
        article['summary'] = to_utf( html.unescape(cleanHtml(items.summary)) )


        if 'thumbnail' in items :
            article['thumbnail'] = to_utf(items['thumbnail'])

        else:
            article['thumbnail'] = None

        articleList.append(article)

    return articleList

之后,我通过发送以下查询字符串在自己的数据库中创建了一个表:

# i made it roughly for my first test    
sql = '''CREATE TABLE articles (
            ID INT NOT NULL auto_increment,
            Press VARCHAR(255) NOT NULL,
            Title VARCHAR(255) NOT NULL,
            Author VARCHAR(255),
            Pubdate TEXT NOT NULL,
            Summary TEXT,
            Link VARCHAR(3000) NOT NULL,
            Thumbnail VARCHAR(3000),
            PRIMARY KEY (ID)
    )'''

最后,我尝试将数据插入到表中。你知道吗

import digitaltrends
import MySQLdb as mysql
article_list = digitaltrends.get_feed()
''''' omitting connecting code '''''
sql = """INSERT INTO articles (Press,Title,Author,Pubdate,Summary,Link,Thumbnail) VALUES ("%s","%s","%s","%s","%s","%s","%s")"""

for each_article in article_list:

    data = (each_article['press'],
            each_article['title'],
            each_article['author'],
            each_article['pubdate'],
            each_article['summary'],
            each_article['link'],
            each_article['thumbnail'])

    cursor.execute(sql,data)
    conn.commit()

执行此代码时发生错误。你知道吗

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-24-4d8567a1f7ad> in <module>()
     11             each_article['thumbnail'])
     12 
---> 13     cursor.execute(sql,data)
     14     conn.commit()
     15 #conn.rollback()

~\Anaconda3\lib\site-packages\MySQLdb\cursors.py in execute(self, query, args)
    232                 args = dict((key, db.literal(item)) for key, item in args.items())
    233             else:
--> 234                 args = tuple(map(db.literal, args))
    235             if not PY2 and isinstance(query, (bytes, bytearray)):
    236                 query = query.decode(db.encoding)

~\Anaconda3\lib\site-packages\MySQLdb\connections.py in literal(self, o)
    316             s = self._tuple_literal(o)
    317         else:
--> 318             s = self.escape(o, self.encoders)
    319         # Python 3(~3.4) doesn't support % operation for bytes object.
    320         # We should decode it before using %.

~\Anaconda3\lib\site-packages\MySQLdb\connections.py in unicode_literal(u, dummy)
    223             # unicode_literal() is called for arbitrary object.
    224             def unicode_literal(u, dummy=None):
--> 225                 return db.string_literal(str(u).encode(db.encoding))
    226 
    227         def bytes_literal(obj, dummy=None):

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2018' in position 18: ordinal not in range(256)

我没有将编码设置为拉丁-1。我的数据库和表也设置为UTF-8编码。我试着自己去修正它,但我不知道为什么我会遇到拉丁语-1的错误。你知道吗

已解决:

我的初始连接代码:

conn = mysql.connect(host = 'localhost',
                        user = me,
                        password = pw,
                        db = bd,
                     )

在我将charset='utf8'参数添加到connect函数之后,它工作得很好。你知道吗


Tags: toinimportfordbdeffeedarticle