python:从URL读取文件

2024-04-29 04:33:46 发布

您现在位置:Python中文网/ 问答频道 /正文

从因特网上读取文本文件的正确方法是什么。 例如这里的文本文件https://gist.githubusercontent.com/deekayen/4148741/raw/01c6252ccc5b5fb307c1bb899c95989a8a284616/1-1000.txt

下面的代码可以工作,但会在每个单词前面产生额外的'b

from urllib.request import urlopen
#url = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english.txt'
url = 'https://gist.githubusercontent.com/deekayen/4148741/raw/01c6252ccc5b5fb307c1bb899c95989a8a284616/1-1000.txt'
#data = urlopen(url)
#print('H w')

# it's a file like object and works just like a file
l = set()
data = urlopen(url)
for line in data:  # files are iterable
    word = line.strip()
    print(word)
    l.add(word)

print(l)

Tags: httpstxtcomurldatarawenglishgoogle
2条回答

必须将每个字节对象解码为unicode。为此,您可以使用方法decode('utf-8')。代码如下:

from urllib.request import urlopen
url = 'https://gist.githubusercontent.com/deekayen/4148741/raw/01c6252ccc5b5fb307c1bb899c95989a8a284616/1-1000.txt'

l = set()
data = urlopen(url)
for line in data:  # files are iterable
    word = line.strip().decode('utf-8') # decode the line into unicode
    print(word)
    l.add(word)

print(l)

使用熊猫很简单。执行就行了

import pandas as pd
pd.read_csv('https://gist.githubusercontent.com/deekayen/4148741/raw/01c6252ccc5b5fb307c1bb899c95989a8a284616/1-1000.txt')

你们都准备好了:)

相关问题 更多 >