如何将Html<script>中的标题和值保存为csv文件中的表格？

2条回答

网友

1楼 · 编辑于 2024-05-23 17:54:27

解决方案

自定义函数process_scripts()将生成您要查找的内容。我使用下面给出的虚拟数据（最后）。首先，我们检查代码是否符合预期，因此我们创建一个数据帧来查看输出

You could also open this Colab Jupyter Notebook and run it on Cloud for free. This will allow you to not worry about any installation or setup and simply focus on examining the solution itself.

一,。处理单个脚本

## Define CSV file-output folder-path
OUTPUT_PATH = './output'
## Process scripts
dfs = process_scripts(scripts = [s], 
                        output_path = OUTPUT_PATH, 
                        save_to_csv = False, 
                        verbose = 0)

print(dfs[0].reset_index(drop=True))

输出：

  Name           2012.12  ...            2019.12           2020.03
0  ANA  1,114,919,783.60  ...   3,270,000,000.87  2,347,500,000.63
1  DEN     55,200,000.00  ...     125,100,000.00     85,350,000.00
2  SIS  4,425,000,000.00  ...  11,857,500,000.00  9,315,000,000.00
3  TRK  1,692,579,200.00  ...   4,375,000,000.00  3,587,500,000.00

[4 rows x 31 columns]

二,。正在处理所有脚本

您可以使用自定义定义函数process_scripts()处理所有脚本。代码如下所示

## Define CSV file-output folder-path
OUTPUT_PATH = './output'
## Process scripts
dfs = process_scripts(scripts, 
                        output_path = OUTPUT_PATH, 
                        save_to_csv = True, 
                        verbose = 0)

## To clear the output dir-contents
#!rm -f $OUTPUT_PATH/*

我在GoogleColab上做了这件事，它按预期工作

三,。以操作系统不可知的方式创建路径

为基于windows或unix的系统创建路径可能非常不同。下面向您展示了一种实现这一点的方法，而不必担心将运行代码的操作系统。我在这里用了^{} library。不过，我建议你也看看^{} library

# Define relative path for output-folder
OUTPUT_PATH = './output'

# Dynamically define absolute path
pwd = os.getcwd() # present-working-directory
OUTPUT_PATH = os.path.join(pwd, os.path.abspath(OUTPUT_PATH))

四,。代码：自定义函数`process_scripts()`

在这里，我们使用regex（正则表达式）库和pandas以表格格式组织数据，然后写入csv文件。tqdm库用于在处理多个脚本时为您提供一个漂亮的progressbar。请查看代码中的注释，了解如果不是从jupyter notebook运行它，该怎么办。os库用于路径操作和创建输出目录

#pip install -U pandas
#pip install tqdm
import pandas as pd
import re # regex
import os
from tqdm.notebook import tqdm
# Use the following line if not using a jupyter notebook
# from tqdm import tqdm

def process_scripts(scripts, 
                    output_path='./output', 
                    save_to_csv: bool=False, 
                    verbose: int=0):
    """Process all scripts and return a list of dataframes and 
    optionally write each dataframe to a CSV file.

    Parameters
         
        scripts: list of scripts
        output_path (str): output-folder-path for csv files
        save_to_csv (bool):  default is False
        verbose (int): prints output for verbose>0

    Example
       -
    OUTPUT_PATH = './output'
    dfs = process_scripts(scripts, 
                          output_path = OUTPUT_PATH, 
                          save_to_csv = True, 
                          verbose = 0)

    ## To clear the output dir-contents
    #!rm -f $OUTPUT_PATH/*
    """
    ## Define regex patterns and compile for speed
    pat_header = re.compile(r"theCols\[\d+\] = new Array\s*\([\'](\d{4}\.\d{1,2})[\'],\d+,\d+\)")
    pat_line = re.compile(r"theRows\[\d+\] = new Array\s*\((.*)\).*")
    pat_code = re.compile("([A-Z]{3})")

    # Calculate zfill-digits
    zfill_digits = len(str(len(scripts)))
    print(f'Total scripts: {len(scripts)}')

    # Create output_path
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    # Define a list of dataframes:  
    #   An accumulator of all scripts
    dfs = []

    ## If you do not have tqdm installed, uncomment the 
    #  next line and comment out the following line.
    #for script_num, script in enumerate(scripts):
    for script_num, script in enumerate(tqdm(scripts, desc='Scripts Processed')):
        ## Extract: Headers, Rows
        #    Rows : code (Name: single column), line-data (multi-column)
        headers = script.strip().split('\n', 0)[0]
        headers = ['Name'] + re.findall(pat_header, headers)
        lines = re.findall(pat_line, script)
        codes = [re.findall(pat_code, line)[0] for line in lines]
        # Clean data for each row
        lines_data = dict()
        for line, code in zip(lines, codes):
            line_data = line.replace("','", "|").split('|')
            line_data[-1] = line_data[-1].replace("'", "")
            line_data[0] = code
            lines_data.update({code: line_data.copy()})
        if verbose>0:
            print('{}: {}'.format(script_num, codes))
        ## Load data into a pandas-dataframe
        #  and write to csv.
        df = pd.DataFrame(lines_data).T
        df.columns = headers
        dfs.append(df.copy()) # update list
        # Write to CSV
        if save_to_csv:
            num_label = str(script_num).zfill(zfill_digits)
            script_file_csv = f'Script_{num_label}.csv' 
            script_path = os.path.join(output_path, script_file_csv)   
            df.to_csv(script_path, index=False)
    return dfs

五,。虚拟数据

## Dummy Data
s = """
<script language="JavaScript"> var theHlp='/yardim/matris.asp';var theTitle = 'Piyasa Değeri';var theCaption='Cam (TL)';var lastmod = '';var h='<a class=hisselink href=../Hisse/HisseAnaliz.aspx?HNO=';var e='<a class=hisselink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('Hisse',4,50);theCols[1] = new Array('2012.12',1,60);theCols[2] = new Array('2013.03',1,60);theCols[3] = new Array('2013.06',1,60);theCols[4] = new Array('2013.09',1,60);theCols[5] = new Array('2013.12',1,60);theCols[6] = new Array('2014.03',1,60);theCols[7] = new Array('2014.06',1,60);theCols[8] = new Array('2014.09',1,60);theCols[9] = new Array('2014.12',1,60);theCols[10] = new Array('2015.03',1,60);theCols[11] = new Array('2015.06',1,60);theCols[12] = new Array('2015.09',1,60);theCols[13] = new Array('2015.12',1,60);theCols[14] = new Array('2016.03',1,60);theCols[15] = new Array('2016.06',1,60);theCols[16] = new Array('2016.09',1,60);theCols[17] = new Array('2016.12',1,60);theCols[18] = new Array('2017.03',1,60);theCols[19] = new Array('2017.06',1,60);theCols[20] = new Array('2017.09',1,60);theCols[21] = new Array('2017.12',1,60);theCols[22] = new Array('2018.03',1,60);theCols[23] = new Array('2018.06',1,60);theCols[24] = new Array('2018.09',1,60);theCols[25] = new Array('2018.12',1,60);theCols[26] = new Array('2019.03',1,60);theCols[27] = new Array('2019.06',1,60);theCols[28] = new Array('2019.09',1,60);theCols[29] = new Array('2019.12',1,60);theCols[30] = new Array('2020.03',1,60);var theRows = new Array();
theRows[0] = new Array ('<b>'+h+'30>ANA</B></a>','1,114,919,783.60','1,142,792,778.19','1,091,028,645.38','991,850,000.48','796,800,000.38','697,200,000.34','751,150,000.36','723,720,000.33','888,000,000.40','790,320,000.36','883,560,000.40','927,960,000.42','737,040,000.33','879,120,000.40','914,640,000.41','927,960,000.42','1,172,160,000.53','1,416,360,000.64','1,589,520,000.72','1,552,500,000.41','1,972,500,000.53','2,520,000,000.67','2,160,000,000.58','2,475,000,000.66','2,010,000,000.54','2,250,000,000.60','2,077,500,000.55','2,332,500,000.62','3,270,000,000.87','2,347,500,000.63');
theRows[1] = new Array ('<b>'+h+'89>DEN</B></a>','55,200,000.00','55,920,000.00','45,960,000.00','42,600,000.00','35,760,000.00','39,600,000.00','40,200,000.00','47,700,000.00','50,460,000.00','45,300,000.00','41,760,000.00','59,340,000.00','66,600,000.00','97,020,000.00','81,060,000.00','69,300,000.00','79,800,000.00','68,400,000.00','66,900,000.00','66,960,000.00','71,220,000.00','71,520,000.00','71,880,000.00','60,600,000.00','69,120,000.00','62,640,000.00','57,180,000.00','89,850,000.00','125,100,000.00','85,350,000.00');
theRows[2] = new Array ('<b>'+h+'269>SIS</B></a>','4,425,000,000.00','4,695,000,000.00','4,050,000,000.00','4,367,380,000.00','4,273,120,000.00','3,644,720,000.00','4,681,580,000.00','4,913,000,000.00','6,188,000,000.00','5,457,000,000.00','6,137,000,000.00','5,453,000,000.00','6,061,000,000.00','6,954,000,000.00','6,745,000,000.00','6,519,000,000.00','7,851,500,000.00','8,548,500,000.00','9,430,000,000.00','9,225,000,000.00','10,575,000,000.00','11,610,000,000.00','9,517,500,000.00','13,140,000,000.00','12,757,500,000.00','13,117,500,000.00','11,677,500,000.00','10,507,500,000.00','11,857,500,000.00','9,315,000,000.00');
theRows[3] = new Array ('<b>'+h+'297>TRK</B></a>','1,692,579,200.00','1,983,924,800.00','1,831,315,200.00','1,704,000,000.00','1,803,400,000.00','1,498,100,000.00','1,803,400,000.00','1,884,450,000.00','2,542,160,000.00','2,180,050,000.00','2,069,200,000.00','1,682,600,000.00','1,619,950,000.00','1,852,650,000.00','2,040,600,000.00','2,315,700,000.00','2,641,200,000.00','2,938,800,000.00','3,599,100,000.00','4,101,900,000.00','5,220,600,000.00','5,808,200,000.00','4,689,500,000.00','5,375,000,000.00','3,787,500,000.00','4,150,000,000.00','3,662,500,000.00','3,712,500,000.00','4,375,000,000.00','3,587,500,000.00');
var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script>
"""

## Make a dummy list of scripts
scripts = [s for _ in range(10)]

网友

2楼 · 编辑于 2024-05-23 17:54:27

根据您问题中提供的<script>，您可以执行类似于下面代码的操作，以获得每个名称ANA, DEN ..的日期列表：

for _ in range(1, len(aaa.split("<b>'"))-1):
  s = aaa.split("<b>'")[_].split("'")
  print(_)
  lst = []
  for i in s:
    if "</B>" in i:
      name = i.split('>')[1].split("<")[0]
      print("{} = ".format(name), end="")
    if any(j.isdigit() for j in i) and ',' in i:
      lst.append(i)
  print(lst)

这只是一个示例代码，所以没有那么漂亮：）

希望这对你有帮助

解决方案

一,。处理单个脚本

二,。正在处理所有脚本

三,。以操作系统不可知的方式创建路径

四,。代码：自定义函数`process_scripts()`

五,。虚拟数据

相关问题更多 >

编程相关推荐

热门问题

热门文章