<h2>解决方案</h2>
<p>自定义函数<code>process_scripts()</code>将生成您要查找的内容。我使用下面给出的虚拟数据(最后)。首先,我们检查代码是否符合预期,因此我们创建一个数据帧来查看输出</p>
<blockquote>
<p>You could also open <a href="https://colab.research.google.com/github/sugatoray/stackoverflow/blob/master/src/answers/Q_62111654/Q_62111654.ipynb" rel="nofollow noreferrer">this Colab Jupyter Notebook</a> and run it on Cloud for free. This will allow you to not worry about any installation or setup and simply focus on examining the solution itself.</p>
</blockquote>
<p><a href="https://colab.research.google.com/github/sugatoray/stackoverflow/blob/master/src/answers/Q_62111654/Q_62111654.ipynb" rel="nofollow noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></p>
<h2>一,。处理单个脚本</h2>
<pre class="lang-py prettyprint-override"><code>## Define CSV file-output folder-path
OUTPUT_PATH = './output'
## Process scripts
dfs = process_scripts(scripts = [s],
output_path = OUTPUT_PATH,
save_to_csv = False,
verbose = 0)
print(dfs[0].reset_index(drop=True))
</code></pre>
<p><strong>输出</strong>:</p>
<pre class="lang-sh prettyprint-override"><code> Name 2012.12 ... 2019.12 2020.03
0 ANA 1,114,919,783.60 ... 3,270,000,000.87 2,347,500,000.63
1 DEN 55,200,000.00 ... 125,100,000.00 85,350,000.00
2 SIS 4,425,000,000.00 ... 11,857,500,000.00 9,315,000,000.00
3 TRK 1,692,579,200.00 ... 4,375,000,000.00 3,587,500,000.00
[4 rows x 31 columns]
</code></pre>
<hr/>
<h2>二,。正在处理所有脚本</h2>
<p>您可以使用自定义定义函数<code>process_scripts()</code>处理所有脚本。代码如下所示</p>
<pre class="lang-py prettyprint-override"><code>## Define CSV file-output folder-path
OUTPUT_PATH = './output'
## Process scripts
dfs = process_scripts(scripts,
output_path = OUTPUT_PATH,
save_to_csv = True,
verbose = 0)
## To clear the output dir-contents
#!rm -f $OUTPUT_PATH/*
</code></pre>
<p>我在GoogleColab上做了这件事,它按预期工作</p>
<p><a href="https://i.stack.imgur.com/v0mzt.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/v0mzt.png" alt="enter image description here"/></a></p>
<h2>三,。以操作系统不可知的方式创建路径</h2>
<p>为基于windows或unix的系统创建路径可能非常不同。下面向您展示了一种实现这一点的方法,而不必担心将运行代码的操作系统。我在这里用了<a href="https://docs.python.org/3.7/library/os.html" rel="nofollow noreferrer">^{<cd3>} library</a>。不过,我建议你也看看<a href="https://docs.python.org/3.7/library/pathlib.html?highlight=pathlib#module-pathlib" rel="nofollow noreferrer">^{<cd4>} library</a></p>
<pre class="lang-py prettyprint-override"><code># Define relative path for output-folder
OUTPUT_PATH = './output'
# Dynamically define absolute path
pwd = os.getcwd() # present-working-directory
OUTPUT_PATH = os.path.join(pwd, os.path.abspath(OUTPUT_PATH))
</code></pre>
<hr/>
<h2>四,。代码:自定义函数<code>process_scripts()</code></h2>
<p>在这里,我们使用<code>regex</code>(正则表达式)库和<code>pandas</code>以表格格式组织数据,然后写入csv文件。<code>tqdm</code>库用于在处理多个脚本时为您提供一个漂亮的<em>progressbar</em>。请查看代码中的注释,了解如果不是从<code>jupyter notebook</code>运行它,该怎么办。<code>os</code>库用于路径操作和创建输出目录</p>
<pre class="lang-py prettyprint-override"><code>#pip install -U pandas
#pip install tqdm
import pandas as pd
import re # regex
import os
from tqdm.notebook import tqdm
# Use the following line if not using a jupyter notebook
# from tqdm import tqdm
def process_scripts(scripts,
output_path='./output',
save_to_csv: bool=False,
verbose: int=0):
"""Process all scripts and return a list of dataframes and
optionally write each dataframe to a CSV file.
Parameters
scripts: list of scripts
output_path (str): output-folder-path for csv files
save_to_csv (bool): default is False
verbose (int): prints output for verbose>0
Example
-
OUTPUT_PATH = './output'
dfs = process_scripts(scripts,
output_path = OUTPUT_PATH,
save_to_csv = True,
verbose = 0)
## To clear the output dir-contents
#!rm -f $OUTPUT_PATH/*
"""
## Define regex patterns and compile for speed
pat_header = re.compile(r"theCols\[\d+\] = new Array\s*\([\'](\d{4}\.\d{1,2})[\'],\d+,\d+\)")
pat_line = re.compile(r"theRows\[\d+\] = new Array\s*\((.*)\).*")
pat_code = re.compile("([A-Z]{3})")
# Calculate zfill-digits
zfill_digits = len(str(len(scripts)))
print(f'Total scripts: {len(scripts)}')
# Create output_path
if not os.path.exists(output_path):
os.makedirs(output_path)
# Define a list of dataframes:
# An accumulator of all scripts
dfs = []
## If you do not have tqdm installed, uncomment the
# next line and comment out the following line.
#for script_num, script in enumerate(scripts):
for script_num, script in enumerate(tqdm(scripts, desc='Scripts Processed')):
## Extract: Headers, Rows
# Rows : code (Name: single column), line-data (multi-column)
headers = script.strip().split('\n', 0)[0]
headers = ['Name'] + re.findall(pat_header, headers)
lines = re.findall(pat_line, script)
codes = [re.findall(pat_code, line)[0] for line in lines]
# Clean data for each row
lines_data = dict()
for line, code in zip(lines, codes):
line_data = line.replace("','", "|").split('|')
line_data[-1] = line_data[-1].replace("'", "")
line_data[0] = code
lines_data.update({code: line_data.copy()})
if verbose>0:
print('{}: {}'.format(script_num, codes))
## Load data into a pandas-dataframe
# and write to csv.
df = pd.DataFrame(lines_data).T
df.columns = headers
dfs.append(df.copy()) # update list
# Write to CSV
if save_to_csv:
num_label = str(script_num).zfill(zfill_digits)
script_file_csv = f'Script_{num_label}.csv'
script_path = os.path.join(output_path, script_file_csv)
df.to_csv(script_path, index=False)
return dfs
</code></pre>
<hr/>
<h2>五,。虚拟数据</h2>
<pre class="lang-py prettyprint-override"><code>## Dummy Data
s = """
<script language="JavaScript"> var theHlp='/yardim/matris.asp';var theTitle = 'Piyasa Değeri';var theCaption='Cam (TL)';var lastmod = '';var h='<a class=hisselink href=../Hisse/HisseAnaliz.aspx?HNO=';var e='<a class=hisselink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('Hisse',4,50);theCols[1] = new Array('2012.12',1,60);theCols[2] = new Array('2013.03',1,60);theCols[3] = new Array('2013.06',1,60);theCols[4] = new Array('2013.09',1,60);theCols[5] = new Array('2013.12',1,60);theCols[6] = new Array('2014.03',1,60);theCols[7] = new Array('2014.06',1,60);theCols[8] = new Array('2014.09',1,60);theCols[9] = new Array('2014.12',1,60);theCols[10] = new Array('2015.03',1,60);theCols[11] = new Array('2015.06',1,60);theCols[12] = new Array('2015.09',1,60);theCols[13] = new Array('2015.12',1,60);theCols[14] = new Array('2016.03',1,60);theCols[15] = new Array('2016.06',1,60);theCols[16] = new Array('2016.09',1,60);theCols[17] = new Array('2016.12',1,60);theCols[18] = new Array('2017.03',1,60);theCols[19] = new Array('2017.06',1,60);theCols[20] = new Array('2017.09',1,60);theCols[21] = new Array('2017.12',1,60);theCols[22] = new Array('2018.03',1,60);theCols[23] = new Array('2018.06',1,60);theCols[24] = new Array('2018.09',1,60);theCols[25] = new Array('2018.12',1,60);theCols[26] = new Array('2019.03',1,60);theCols[27] = new Array('2019.06',1,60);theCols[28] = new Array('2019.09',1,60);theCols[29] = new Array('2019.12',1,60);theCols[30] = new Array('2020.03',1,60);var theRows = new Array();
theRows[0] = new Array ('<b>'+h+'30>ANA</B></a>','1,114,919,783.60','1,142,792,778.19','1,091,028,645.38','991,850,000.48','796,800,000.38','697,200,000.34','751,150,000.36','723,720,000.33','888,000,000.40','790,320,000.36','883,560,000.40','927,960,000.42','737,040,000.33','879,120,000.40','914,640,000.41','927,960,000.42','1,172,160,000.53','1,416,360,000.64','1,589,520,000.72','1,552,500,000.41','1,972,500,000.53','2,520,000,000.67','2,160,000,000.58','2,475,000,000.66','2,010,000,000.54','2,250,000,000.60','2,077,500,000.55','2,332,500,000.62','3,270,000,000.87','2,347,500,000.63');
theRows[1] = new Array ('<b>'+h+'89>DEN</B></a>','55,200,000.00','55,920,000.00','45,960,000.00','42,600,000.00','35,760,000.00','39,600,000.00','40,200,000.00','47,700,000.00','50,460,000.00','45,300,000.00','41,760,000.00','59,340,000.00','66,600,000.00','97,020,000.00','81,060,000.00','69,300,000.00','79,800,000.00','68,400,000.00','66,900,000.00','66,960,000.00','71,220,000.00','71,520,000.00','71,880,000.00','60,600,000.00','69,120,000.00','62,640,000.00','57,180,000.00','89,850,000.00','125,100,000.00','85,350,000.00');
theRows[2] = new Array ('<b>'+h+'269>SIS</B></a>','4,425,000,000.00','4,695,000,000.00','4,050,000,000.00','4,367,380,000.00','4,273,120,000.00','3,644,720,000.00','4,681,580,000.00','4,913,000,000.00','6,188,000,000.00','5,457,000,000.00','6,137,000,000.00','5,453,000,000.00','6,061,000,000.00','6,954,000,000.00','6,745,000,000.00','6,519,000,000.00','7,851,500,000.00','8,548,500,000.00','9,430,000,000.00','9,225,000,000.00','10,575,000,000.00','11,610,000,000.00','9,517,500,000.00','13,140,000,000.00','12,757,500,000.00','13,117,500,000.00','11,677,500,000.00','10,507,500,000.00','11,857,500,000.00','9,315,000,000.00');
theRows[3] = new Array ('<b>'+h+'297>TRK</B></a>','1,692,579,200.00','1,983,924,800.00','1,831,315,200.00','1,704,000,000.00','1,803,400,000.00','1,498,100,000.00','1,803,400,000.00','1,884,450,000.00','2,542,160,000.00','2,180,050,000.00','2,069,200,000.00','1,682,600,000.00','1,619,950,000.00','1,852,650,000.00','2,040,600,000.00','2,315,700,000.00','2,641,200,000.00','2,938,800,000.00','3,599,100,000.00','4,101,900,000.00','5,220,600,000.00','5,808,200,000.00','4,689,500,000.00','5,375,000,000.00','3,787,500,000.00','4,150,000,000.00','3,662,500,000.00','3,712,500,000.00','4,375,000,000.00','3,587,500,000.00');
var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script>
"""
## Make a dummy list of scripts
scripts = [s for _ in range(10)]
</code></pre>