如何将Html<script>中的标题和值保存为csv文件中的表格？问题的回答

如何将Html<script>中的标题和值保存为csv文件中的表格？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<h2>解决方案</h2> <p>自定义函数<code>process_scripts()</code>将生成您要查找的内容。我使用下面给出的虚拟数据（最后）。首先，我们检查代码是否符合预期，因此我们创建一个数据帧来查看输出</p> <blockquote> <p>You could also open <a href="https://colab.research.google.com/github/sugatoray/stackoverflow/blob/master/src/answers/Q_62111654/Q_62111654.ipynb" rel="nofollow noreferrer">this Colab Jupyter Notebook</a> and run it on Cloud for free. This will allow you to not worry about any installation or setup and simply focus on examining the solution itself.</p> </blockquote> <p><a href="https://colab.research.google.com/github/sugatoray/stackoverflow/blob/master/src/answers/Q_62111654/Q_62111654.ipynb" rel="nofollow noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></p> <h2>一,。处理单个脚本</h2> <pre class="lang-py prettyprint-override"><code>## Define CSV file-output folder-path OUTPUT_PATH = './output' ## Process scripts dfs = process_scripts(scripts = [s], output_path = OUTPUT_PATH, save_to_csv = False, verbose = 0) print(dfs[0].reset_index(drop=True)) </code></pre> <p><strong>输出</strong>：</p> <pre class="lang-sh prettyprint-override"><code> Name 2012.12 ... 2019.12 2020.03 0 ANA 1,114,919,783.60 ... 3,270,000,000.87 2,347,500,000.63 1 DEN 55,200,000.00 ... 125,100,000.00 85,350,000.00 2 SIS 4,425,000,000.00 ... 11,857,500,000.00 9,315,000,000.00 3 TRK 1,692,579,200.00 ... 4,375,000,000.00 3,587,500,000.00 [4 rows x 31 columns] </code></pre> <hr/> <h2>二,。正在处理所有脚本</h2> <p>您可以使用自定义定义函数<code>process_scripts()</code>处理所有脚本。代码如下所示</p> <pre class="lang-py prettyprint-override"><code>## Define CSV file-output folder-path OUTPUT_PATH = './output' ## Process scripts dfs = process_scripts(scripts, output_path = OUTPUT_PATH, save_to_csv = True, verbose = 0) ## To clear the output dir-contents #!rm -f $OUTPUT_PATH/* </code></pre> <p>我在GoogleColab上做了这件事，它按预期工作</p> <p><a href="https://i.stack.imgur.com/v0mzt.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/v0mzt.png" alt="enter image description here"/></a></p> <h2>三,。以操作系统不可知的方式创建路径</h2> <p>为基于windows或unix的系统创建路径可能非常不同。下面向您展示了一种实现这一点的方法，而不必担心将运行代码的操作系统。我在这里用了<a href="https://docs.python.org/3.7/library/os.html" rel="nofollow noreferrer">^{<cd3>} library</a>。不过，我建议你也看看<a href="https://docs.python.org/3.7/library/pathlib.html?highlight=pathlib#module-pathlib" rel="nofollow noreferrer">^{<cd4>} library</a></p> <pre class="lang-py prettyprint-override"><code># Define relative path for output-folder OUTPUT_PATH = './output' # Dynamically define absolute path pwd = os.getcwd() # present-working-directory OUTPUT_PATH = os.path.join(pwd, os.path.abspath(OUTPUT_PATH)) </code></pre> <hr/> <h2>四,。代码：自定义函数<code>process_scripts()</code></h2> <p>在这里，我们使用<code>regex</code>（正则表达式）库和<code>pandas</code>以表格格式组织数据，然后写入csv文件。<code>tqdm</code>库用于在处理多个脚本时为您提供一个漂亮的<em>progressbar</em>。请查看代码中的注释，了解如果不是从<code>jupyter notebook</code>运行它，该怎么办。<code>os</code>库用于路径操作和创建输出目录</p> <pre class="lang-py prettyprint-override"><code>#pip install -U pandas #pip install tqdm import pandas as pd import re # regex import os from tqdm.notebook import tqdm # Use the following line if not using a jupyter notebook # from tqdm import tqdm def process_scripts(scripts, output_path='./output', save_to_csv: bool=False, verbose: int=0): """Process all scripts and return a list of dataframes and optionally write each dataframe to a CSV file. Parameters scripts: list of scripts output_path (str): output-folder-path for csv files save_to_csv (bool): default is False verbose (int): prints output for verbose>0 Example - OUTPUT_PATH = './output' dfs = process_scripts(scripts, output_path = OUTPUT_PATH, save_to_csv = True, verbose = 0) ## To clear the output dir-contents #!rm -f $OUTPUT_PATH/* """ ## Define regex patterns and compile for speed pat_header = re.compile(r"theCols\[\d+\] = new Array\s*$[\'](\d{4}\.\d{1,2})[\'],\d+,\d+$") pat_line = re.compile(r"theRows\[\d+\] = new Array\s*$(.*)$.*") pat_code = re.compile("([A-Z]{3})") # Calculate zfill-digits zfill_digits = len(str(len(scripts))) print(f'Total scripts: {len(scripts)}') # Create output_path if not os.path.exists(output_path): os.makedirs(output_path) # Define a list of dataframes: # An accumulator of all scripts dfs = [] ## If you do not have tqdm installed, uncomment the # next line and comment out the following line. #for script_num, script in enumerate(scripts): for script_num, script in enumerate(tqdm(scripts, desc='Scripts Processed')): ## Extract: Headers, Rows # Rows : code (Name: single column), line-data (multi-column) headers = script.strip().split('\n', 0)[0] headers = ['Name'] + re.findall(pat_header, headers) lines = re.findall(pat_line, script) codes = [re.findall(pat_code, line)[0] for line in lines] # Clean data for each row lines_data = dict() for line, code in zip(lines, codes): line_data = line.replace("','", "|").split('|') line_data[-1] = line_data[-1].replace("'", "") line_data[0] = code lines_data.update({code: line_data.copy()}) if verbose>0: print('{}: {}'.format(script_num, codes)) ## Load data into a pandas-dataframe # and write to csv. df = pd.DataFrame(lines_data).T df.columns = headers dfs.append(df.copy()) # update list # Write to CSV if save_to_csv: num_label = str(script_num).zfill(zfill_digits) script_file_csv = f'Script_{num_label}.csv' script_path = os.path.join(output_path, script_file_csv) df.to_csv(script_path, index=False) return dfs </code></pre> <hr/> <h2>五,。虚拟数据</h2> <pre class="lang-py prettyprint-override"><code>## Dummy Data s = """ <script language="JavaScript"> var theHlp='/yardim/matris.asp';var theTitle = 'Piyasa Değeri';var theCaption='Cam (TL)';var lastmod = '';var h='<a class=hisselink href=../Hisse/HisseAnaliz.aspx?HNO=';var e='<a class=hisselink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('Hisse',4,50);theCols[1] = new Array('2012.12',1,60);theCols[2] = new Array('2013.03',1,60);theCols[3] = new Array('2013.06',1,60);theCols[4] = new Array('2013.09',1,60);theCols[5] = new Array('2013.12',1,60);theCols[6] = new Array('2014.03',1,60);theCols[7] = new Array('2014.06',1,60);theCols[8] = new Array('2014.09',1,60);theCols[9] = new Array('2014.12',1,60);theCols[10] = new Array('2015.03',1,60);theCols[11] = new Array('2015.06',1,60);theCols[12] = new Array('2015.09',1,60);theCols[13] = new Array('2015.12',1,60);theCols[14] = new Array('2016.03',1,60);theCols[15] = new Array('2016.06',1,60);theCols[16] = new Array('2016.09',1,60);theCols[17] = new Array('2016.12',1,60);theCols[18] = new Array('2017.03',1,60);theCols[19] = new Array('2017.06',1,60);theCols[20] = new Array('2017.09',1,60);theCols[21] = new Array('2017.12',1,60);theCols[22] = new Array('2018.03',1,60);theCols[23] = new Array('2018.06',1,60);theCols[24] = new Array('2018.09',1,60);theCols[25] = new Array('2018.12',1,60);theCols[26] = new Array('2019.03',1,60);theCols[27] = new Array('2019.06',1,60);theCols[28] = new Array('2019.09',1,60);theCols[29] = new Array('2019.12',1,60);theCols[30] = new Array('2020.03',1,60);var theRows = new Array(); theRows[0] = new Array ('<b>'+h+'30>ANA</B></a>','1,114,919,783.60','1,142,792,778.19','1,091,028,645.38','991,850,000.48','796,800,000.38','697,200,000.34','751,150,000.36','723,720,000.33','888,000,000.40','790,320,000.36','883,560,000.40','927,960,000.42','737,040,000.33','879,120,000.40','914,640,000.41','927,960,000.42','1,172,160,000.53','1,416,360,000.64','1,589,520,000.72','1,552,500,000.41','1,972,500,000.53','2,520,000,000.67','2,160,000,000.58','2,475,000,000.66','2,010,000,000.54','2,250,000,000.60','2,077,500,000.55','2,332,500,000.62','3,270,000,000.87','2,347,500,000.63'); theRows[1] = new Array ('<b>'+h+'89>DEN</B></a>','55,200,000.00','55,920,000.00','45,960,000.00','42,600,000.00','35,760,000.00','39,600,000.00','40,200,000.00','47,700,000.00','50,460,000.00','45,300,000.00','41,760,000.00','59,340,000.00','66,600,000.00','97,020,000.00','81,060,000.00','69,300,000.00','79,800,000.00','68,400,000.00','66,900,000.00','66,960,000.00','71,220,000.00','71,520,000.00','71,880,000.00','60,600,000.00','69,120,000.00','62,640,000.00','57,180,000.00','89,850,000.00','125,100,000.00','85,350,000.00'); theRows[2] = new Array ('<b>'+h+'269>SIS</B></a>','4,425,000,000.00','4,695,000,000.00','4,050,000,000.00','4,367,380,000.00','4,273,120,000.00','3,644,720,000.00','4,681,580,000.00','4,913,000,000.00','6,188,000,000.00','5,457,000,000.00','6,137,000,000.00','5,453,000,000.00','6,061,000,000.00','6,954,000,000.00','6,745,000,000.00','6,519,000,000.00','7,851,500,000.00','8,548,500,000.00','9,430,000,000.00','9,225,000,000.00','10,575,000,000.00','11,610,000,000.00','9,517,500,000.00','13,140,000,000.00','12,757,500,000.00','13,117,500,000.00','11,677,500,000.00','10,507,500,000.00','11,857,500,000.00','9,315,000,000.00'); theRows[3] = new Array ('<b>'+h+'297>TRK</B></a>','1,692,579,200.00','1,983,924,800.00','1,831,315,200.00','1,704,000,000.00','1,803,400,000.00','1,498,100,000.00','1,803,400,000.00','1,884,450,000.00','2,542,160,000.00','2,180,050,000.00','2,069,200,000.00','1,682,600,000.00','1,619,950,000.00','1,852,650,000.00','2,040,600,000.00','2,315,700,000.00','2,641,200,000.00','2,938,800,000.00','3,599,100,000.00','4,101,900,000.00','5,220,600,000.00','5,808,200,000.00','4,689,500,000.00','5,375,000,000.00','3,787,500,000.00','4,150,000,000.00','3,662,500,000.00','3,712,500,000.00','4,375,000,000.00','3,587,500,000.00'); var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script> """ ## Make a dummy list of scripts scripts = [s for _ in range(10)] </code></pre>

如何将Html<script>中的标题和值保存为csv文件中的表格？

1 个回答

相关Python问题