python：ValueError：无效的数字字面量用于int()，基数10：'

1 投票

2 回答

10484 浏览

提问于 2025-04-17 02:25

我有一个文本文件，里面有这样的内容：

70154::308933::3
UserId::ProductId::Score

我写了这个程序来读取这些内容：

(抱歉，缩进有点乱)

def generateSyntheticData(fileName):
 dataDict = {}
 # rowDict = []
 innerDict = {}


 try:
    # for key in range(5):
    # count = 0
    myFile = open(fileName)
    c = 0
        #del innerDict[0:len(innerDict)]

    for line in myFile:
        c += 1
        #line = str(line)
        n = len(line)
        #print 'n: ',n
        if n is not 1:
       # if c%100 ==0: print "%d: "%c, " entries read so far"
       # words = line.replace(' ','_')
            words = line.replace('::',' ')

            words = words.strip().split()


            #print 'userid: ', words[0]
            userId = int( words[0]) # i get error here
            movieId = int (words[1])
            rating =float( words[2])
            print "userId: ", userId, " productId: ", movieId," :rating: ", rating
            #print words
            #words = words.replace('_', ' ')
            innerDict = dataDict.setdefault(userId,{})
            innerDict[movieId] = rating
            dataDict[userId] = (innerDict)
            innerDict = {}
except IOError as (errno,strerror):
    print "I/O error({0}) :{1} ".format(errno,strerror)

finally:
    myFile.close() 
print "total ratings read from file",fileName," :%d " %c
return dataDict

但是我遇到了这个错误：

ValueError: invalid literal for int() with base 10: ''

有趣的是，从其他文件读取相同格式的数据时，它工作得很好。
其实在我发这个问题的时候，我注意到了一些奇怪的事情……
比如这个条目 70154::308933::3，每个数字之间都有空格，像是7后面有空格，0后面有空格，1后面有空格，5后面有空格，4后面有空格，::后面有空格，最后是3……
但是文本文件看起来没问题……:( 只有在复制粘贴的时候才显示出这种情况……
总之，有没有人知道这是怎么回事？
谢谢！

错误处理文本处理文件格式空格处理数据清洗数据读取字符串解析编码问题

2 个回答

调试基础知识：只需将这一行改成：

words = words.strip().split()

改为：

words = words.strip().split()
print words

然后看看结果是什么。

我想提几点。如果你在文件里看到字面上的 UserId::...，然后试图处理它，系统可不会喜欢你把它转换成整数。

还有那行有点奇怪的：

if n is not 1:

我可能会写成：

if n != 1:

如果你在评论中提到的情况发生了：

['\x007\x000\x001\x005\x004\x00', '\x003\x000\x008\x009\x003\x003\x00', '3']

那么我建议你检查一下你的输入文件，看看里面是否有二进制（非文本）数据。如果你只是读取文本并进行修剪/分割，应该不会出现这种二进制信息。

而且因为你提到数字之间似乎有空格，所以你应该对文件进行十六进制转储，看看里面到底有什么。比如，它可能是一个UTF-16的Unicode字符串。

回答于 2025-04-17 由 Python大师

分享举报

你看到的那些“空格”其实是NUL字符（"\x00"）。你的文件很可能是用UTF-16、UTF-16LE或UTF-16BE编码的，几乎可以肯定。如果这个文件只是偶尔用一次，你可以用记事本打开它，然后选择“另存为”，记得选“ANSI”格式，而不是“Unicode”或“Unicode bigendian”。不过，如果你需要按原样处理这个文件，你就得知道或者检测它的编码方式。要找出编码方式，可以这样做：

print repr(open("yourfile.txt", "rb").read(20))

然后把输出的开头和下面的内容进行比较：

>>> ucode = u"70154:"
>>> for sfx in ["", "LE", "BE"]:
...     enc = "UTF-16" + sfx
...     print enc, repr(ucode.encode(enc))
...
UTF-16 '\xff\xfe7\x000\x001\x005\x004\x00:\x00'
UTF-16LE '7\x000\x001\x005\x004\x00:\x00'
UTF-16BE '\x007\x000\x001\x005\x004\x00:'
>>>

你可以通过检查前两个字节来制作一个足够好的编码检测器：

[pseudocode]
if f2b in `"\xff\xfe\xff"`: UTF-16
elif f2b[1] == `"\x00"`: UTF-16LE
elif f2b[0] == `"\x00"`: UTF-16BE
else: cp1252 or UTF-8 or whatever else is prevalent in your neck of the woods.

你还可以避免硬编码备用编码：

>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

你的读取行的代码大概会是这样的：

rawbytes = open(myFile, "rb").read()
enc = detect_encoding(rawbytes[:2])
for line in rawbytes.decode(enc).splitlines():
    # whatever

哦，对了，读取到的行会是unicode对象……如果这让你遇到问题，可以再问其他问题。

回答于 2025-04-17 由 Python大师

分享举报

python：ValueError：无效的数字字面量用于int()，基数10：'

2 个回答

撰写回答