pd.read\u html不适用于多个URL。索引器：列表索引超出范围

IndexError Traceback (most recent call last) <ipython-input-13-784175815486> in <module> ----> 1 df = pd.read_html(url, flavor = 'lxml') C:\Python\Python38\lib\site-packages\pandas\io\html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only) 1088 ) 1089 _validate_header_arg(header) -> 1090 return _parse( 1091 flavor=flavor, 1092 io=io, C:\Python\Python38\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs) 915 for table in tables: 916 try: --> 917 ret.append(_data_to_frame(data=table, **kwargs)) 918 except EmptyDataError: # empty table 919 continue C:\Python\Python38\lib\site-packages\pandas\io\html.py in _data_to_frame(**kwargs) 791 # fill out elements of body that are "ragged" 792 _expand_elements(body) --> 793 tp = TextParser(body, header=header, **kwargs) 794 df = tp.read() 795 return df C:\Python\Python38\lib\site-packages\pandas\io\parsers.py in TextParser(*args, **kwds) 2221 """ 2222 kwds["engine"] = "python" -> 2223 return TextFileReader(*args, **kwds) 2224 2225 C:\Python\Python38\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds) 893 self.options["has_index_names"] = kwds["has_index_names"] 894 --> 895 self._make_engine(self.engine) 896 897 def close(self): C:\Python\Python38\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine) 1145 ' "python-fwf")'.format(engine=engine) 1146 ) -> 1147 self._engine = klass(self.f, **self.options) 1148 1149 def _failover_to_python(self): C:\Python\Python38\lib\site-packages\pandas\io\parsers.py in __init__(self, f, **kwds) 2308 self.num_original_columns, 2309 self.unnamed_cols, -> 2310 ) = self._infer_columns() 2311 2312 # Now self.columns has the set of columns that we will process. C:\Python\Python38\lib\site-packages\pandas\io\parsers.py in _infer_columns(self) 2691 columns = [names] 2692 else: -> 2693 columns = self._handle_usecols(columns, columns[0]) 2694 else: 2695 try: IndexError: list index out of range

1条回答

网友

1楼 · 发布于 2024-05-14 03:17:46

这可能不是一个完整的答案，但我今天早些时候刚刚注册，不允许发表评论，所以我提前表示歉意

看起来页面上的表格有问题。其中有几个行（tr）的标题为空（th），没有数据字段（td）

如果你看：https://www.sec.gov/Archives/edgar/data/0001119774/000117891309002587/zk97422.htm

这是它找到的第一个表：

<table cellpadding="0" cellspacing="0" border="0" width="100%">
<tbody><tr valign="Bottom">
     <th><font face="Times New Roman" size="1"></font></th>
     <th><font face="Times New Roman" size="1"></font></th></tr>
<tr valign="Bottom">
     <th><font face="Times New Roman" size="1"></font></th>
     <th><font face="Times New Roman" size="1"></font></th></tr>
<tr valign="Bottom">
     <th><font face="Times New Roman" size="1"></font></th>
     <th><font face="Times New Roman" size="1"></font></th></tr>
<tr valign="Bottom">
     <th><font face="Times New Roman" size="1"></font></th>
     <th><font face="Times New Roman" size="1"></font></th></tr>
<tr valign="Bottom">
     <th><font face="Times New Roman" size="1"></font></th>
     <th><font face="Times New Roman" size="1"></font></th></tr>
<tr valign="Bottom">
     <td width="50%" align="LEFT"><font face="Times New Roman" size="2">PROSPECTUS SUPPLEMENT</font></td>
     <td width="50%" align="RIGHT"><font face="Times New Roman" size="2">Filed Pursuant to Rule 424(b)(5)&nbsp;</font></td></tr>
<tr valign="Bottom">
     <td align="LEFT"><font face="Times New Roman" size="2">(To Prospectus dated August 11, 2009)</font></td>
     <td align="RIGHT"><font face="Times New Roman" size="2">Registration No. 333-161241&nbsp;</font></td></tr>
</tbody></table>

前5行没有任何td字段，也没有标题文本。我尝试将该表保存到本地文件，然后在该文件上运行read_html，这给了我相同的错误。如果我删除前5行中只有空标题的行，那么它可以工作：

[                                       0                                 1
0                  PROSPECTUS SUPPLEMENT  Filed Pursuant to Rule 424(b)(5)
1  (To Prospectus dated August 11, 2009)       Registration No. 333-161241]

我不习惯使用Pandas，所以我不确定是否有办法强迫它跳过那些空的tr元素

我还发现了这个问题： pandas read_html clean up before or after read

虽然这个问题是另一个问题，但最好尝试使用类似于BeautifulSoup的东西？熊猫似乎不能很好地处理这一页。同样基于这个答案： pandas read_html clean up before or after read 这对于上表也是非常正确的。就HTML而言，它与页面上实际显示的内容非常不同

相关问题更多 >

编程相关推荐

热门问题

热门文章