考虑这个简单的例子
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
df = pd.DataFrame({'link' : ['https://en.wikipedia.org/wiki/World%27s_funniest_joke',
'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World']})
def puller(mylink):
doc = requests.get(mylink)
return BeautifulSoup(doc.content, 'html5lib')
df['parsed'] = df.link.apply(lambda x: puller(x))
这将正确地将wikipedia页面的html信息拉入数据框架。我需要保存该数据帧供以后使用。不幸的是,当我导出到json
时,parsed
列消失了
df
Out[32]:
link \
0 https://en.wikipedia.org/wiki/World%27s_funniest_joke
1 https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World
parsed
0 [html, [[\n, <meta charset="utf-8"/>, \n, <title>World's funniest joke - Wikipedia</title>, \n, <script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"X@pXWwpAIDEAAE2fgjIAAABR","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"World's_funniest_joke","wgTitle":"World's funniest joke","wgCurRevisionId":966811540,"wgRevisionId":966811540,"wgArticleId":647561,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["W...
1 [html, [[\n, <meta charset="utf-8"/>, \n, <title>The Funniest Joke in the World - Wikipedia</title>, \n, <script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"X@pjbgpAICMAADkWE2kAAACQ","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"The_Funniest_Joke_in_the_World","wgTitle":"The Funniest Joke in the World","wgCurRevisionId":996508425,"wgRevisionId":996508425,"wgArticleId":19164,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups...
但是
df.to_json('myjson.json')
df = pd.read_json('myjson.json')
df
Out[30]:
link parsed
0 https://en.wikipedia.org/wiki/World%27s_funniest_joke NaN
1 https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World NaN
这里的问题是什么?或者,存储此数据帧的最佳格式是什么csv
往往会被这么多奇怪的html字符破坏
谢谢
看起来pandas不知道如何在json中正确处理此实例中的字符转义。如果您不需要使用熊猫,您可以执行以下操作:
根据您的代码,您正在使用第三方进行web抓取。在python中使用
“url”:str(url), “uid”:str(uniqueID), “页面内容”:htmlContent.text, “日期”:最终日期 })
将
bs4.BeautifulSoup
类型转换为str
类型相关问题 更多 >
编程相关推荐