如何将Html<script>中的标题和值保存为csv文件中的表格?

2024-05-23 17:54:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我不太会写代码。使用slenium和beautifulsoup,我在网页上的几十个脚本中找到了我想要的脚本。我在找剧本[17]。当执行这些代码时,脚本[17]给出如下结果

我的代码的最后一部分


html=driver.page_source
soup=BeautifulSoup(html, "html.parser")
scripts=soup.find_all("script")
x=scripts[17]
print(x)

结果,输出 注意:日期列表在脚本前面[17]。滑动杆。虚拟数据

虚拟数据

<script language="JavaScript"> var theHlp='/yardim/matris.asp';var theTitle = 'Piyasa Değeri';var theCaption='Cam (TL)';var lastmod = '';var h='<a class=hisselink href=../Hisse/HisseAnaliz.aspx?HNO=';var e='<a class=hisselink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('Hisse',4,50);theCols[1] = new Array('2012.12',1,60);theCols[2] = new Array('2013.03',1,60);theCols[3] = new Array('2013.06',1,60);theCols[4] = new Array('2013.09',1,60);theCols[5] = new Array('2013.12',1,60);theCols[6] = new Array('2014.03',1,60);theCols[7] = new Array('2014.06',1,60);theCols[8] = new Array('2014.09',1,60);theCols[9] = new Array('2014.12',1,60);theCols[10] = new Array('2015.03',1,60);theCols[11] = new Array('2015.06',1,60);theCols[12] = new Array('2015.09',1,60);theCols[13] = new Array('2015.12',1,60);theCols[14] = new Array('2016.03',1,60);theCols[15] = new Array('2016.06',1,60);theCols[16] = new Array('2016.09',1,60);theCols[17] = new Array('2016.12',1,60);theCols[18] = new Array('2017.03',1,60);theCols[19] = new Array('2017.06',1,60);theCols[20] = new Array('2017.09',1,60);theCols[21] = new Array('2017.12',1,60);theCols[22] = new Array('2018.03',1,60);theCols[23] = new Array('2018.06',1,60);theCols[24] = new Array('2018.09',1,60);theCols[25] = new Array('2018.12',1,60);theCols[26] = new Array('2019.03',1,60);theCols[27] = new Array('2019.06',1,60);theCols[28] = new Array('2019.09',1,60);theCols[29] = new Array('2019.12',1,60);theCols[30] = new Array('2020.03',1,60);var theRows = new Array();
theRows[0] = new Array ('<b>'+h+'30>ANA</B></a>','1,114,919,783.60','1,142,792,778.19','1,091,028,645.38','991,850,000.48','796,800,000.38','697,200,000.34','751,150,000.36','723,720,000.33','888,000,000.40','790,320,000.36','883,560,000.40','927,960,000.42','737,040,000.33','879,120,000.40','914,640,000.41','927,960,000.42','1,172,160,000.53','1,416,360,000.64','1,589,520,000.72','1,552,500,000.41','1,972,500,000.53','2,520,000,000.67','2,160,000,000.58','2,475,000,000.66','2,010,000,000.54','2,250,000,000.60','2,077,500,000.55','2,332,500,000.62','3,270,000,000.87','2,347,500,000.63');
theRows[1] = new Array ('<b>'+h+'89>DEN</B></a>','55,200,000.00','55,920,000.00','45,960,000.00','42,600,000.00','35,760,000.00','39,600,000.00','40,200,000.00','47,700,000.00','50,460,000.00','45,300,000.00','41,760,000.00','59,340,000.00','66,600,000.00','97,020,000.00','81,060,000.00','69,300,000.00','79,800,000.00','68,400,000.00','66,900,000.00','66,960,000.00','71,220,000.00','71,520,000.00','71,880,000.00','60,600,000.00','69,120,000.00','62,640,000.00','57,180,000.00','89,850,000.00','125,100,000.00','85,350,000.00');
theRows[2] = new Array ('<b>'+h+'269>SIS</B></a>','4,425,000,000.00','4,695,000,000.00','4,050,000,000.00','4,367,380,000.00','4,273,120,000.00','3,644,720,000.00','4,681,580,000.00','4,913,000,000.00','6,188,000,000.00','5,457,000,000.00','6,137,000,000.00','5,453,000,000.00','6,061,000,000.00','6,954,000,000.00','6,745,000,000.00','6,519,000,000.00','7,851,500,000.00','8,548,500,000.00','9,430,000,000.00','9,225,000,000.00','10,575,000,000.00','11,610,000,000.00','9,517,500,000.00','13,140,000,000.00','12,757,500,000.00','13,117,500,000.00','11,677,500,000.00','10,507,500,000.00','11,857,500,000.00','9,315,000,000.00');
theRows[3] = new Array ('<b>'+h+'297>TRK</B></a>','1,692,579,200.00','1,983,924,800.00','1,831,315,200.00','1,704,000,000.00','1,803,400,000.00','1,498,100,000.00','1,803,400,000.00','1,884,450,000.00','2,542,160,000.00','2,180,050,000.00','2,069,200,000.00','1,682,600,000.00','1,619,950,000.00','1,852,650,000.00','2,040,600,000.00','2,315,700,000.00','2,641,200,000.00','2,938,800,000.00','3,599,100,000.00','4,101,900,000.00','5,220,600,000.00','5,808,200,000.00','4,689,500,000.00','5,375,000,000.00','3,787,500,000.00','4,150,000,000.00','3,662,500,000.00','3,712,500,000.00','4,375,000,000.00','3,587,500,000.00');
var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script>


我的目的是将此输出提取到表中,并将其保存为csv文件。如何提取我想要的脚本

所有日期都应位于顶部,所有名称都应位于最右侧,所有值都应介于两者之间

Hisse      2012.12             2013.3             2013.4 ...
ANA      1,114,919,783.60   1,142,792,778.19     1,091,028,645.38  ...
DEN      55,200,000.00      55,920,000.00        45,960,000.00  ....
.
.
.

Tags: 代码脚本truenewvarhtmlscriptarray
2条回答

解决方案

自定义函数process_scripts()将生成您要查找的内容。我使用下面给出的虚拟数据(最后)。首先,我们检查代码是否符合预期,因此我们创建一个数据帧来查看输出

You could also open this Colab Jupyter Notebook and run it on Cloud for free. This will allow you to not worry about any installation or setup and simply focus on examining the solution itself.

Open In Colab

一,。处理单个脚本

## Define CSV file-output folder-path
OUTPUT_PATH = './output'
## Process scripts
dfs = process_scripts(scripts = [s], 
                        output_path = OUTPUT_PATH, 
                        save_to_csv = False, 
                        verbose = 0)

print(dfs[0].reset_index(drop=True))

输出

  Name           2012.12  ...            2019.12           2020.03
0  ANA  1,114,919,783.60  ...   3,270,000,000.87  2,347,500,000.63
1  DEN     55,200,000.00  ...     125,100,000.00     85,350,000.00
2  SIS  4,425,000,000.00  ...  11,857,500,000.00  9,315,000,000.00
3  TRK  1,692,579,200.00  ...   4,375,000,000.00  3,587,500,000.00

[4 rows x 31 columns]

二,。正在处理所有脚本

您可以使用自定义定义函数process_scripts()处理所有脚本。代码如下所示

## Define CSV file-output folder-path
OUTPUT_PATH = './output'
## Process scripts
dfs = process_scripts(scripts, 
                        output_path = OUTPUT_PATH, 
                        save_to_csv = True, 
                        verbose = 0)

## To clear the output dir-contents
#!rm -f $OUTPUT_PATH/*

我在GoogleColab上做了这件事,它按预期工作

enter image description here

三,。以操作系统不可知的方式创建路径

为基于windows或unix的系统创建路径可能非常不同。下面向您展示了一种实现这一点的方法,而不必担心将运行代码的操作系统。我在这里用了^{} library。不过,我建议你也看看^{} library

# Define relative path for output-folder
OUTPUT_PATH = './output'

# Dynamically define absolute path
pwd = os.getcwd() # present-working-directory
OUTPUT_PATH = os.path.join(pwd, os.path.abspath(OUTPUT_PATH))

四,。代码:自定义函数process_scripts()

在这里,我们使用regex(正则表达式)库和pandas以表格格式组织数据,然后写入csv文件。tqdm库用于在处理多个脚本时为您提供一个漂亮的progressbar。请查看代码中的注释,了解如果不是从jupyter notebook运行它,该怎么办。os库用于路径操作和创建输出目录

#pip install -U pandas
#pip install tqdm
import pandas as pd
import re # regex
import os
from tqdm.notebook import tqdm
# Use the following line if not using a jupyter notebook
# from tqdm import tqdm

def process_scripts(scripts, 
                    output_path='./output', 
                    save_to_csv: bool=False, 
                    verbose: int=0):
    """Process all scripts and return a list of dataframes and 
    optionally write each dataframe to a CSV file.

    Parameters
         
        scripts: list of scripts
        output_path (str): output-folder-path for csv files
        save_to_csv (bool):  default is False
        verbose (int): prints output for verbose>0

    Example
       -
    OUTPUT_PATH = './output'
    dfs = process_scripts(scripts, 
                          output_path = OUTPUT_PATH, 
                          save_to_csv = True, 
                          verbose = 0)

    ## To clear the output dir-contents
    #!rm -f $OUTPUT_PATH/*
    """
    ## Define regex patterns and compile for speed
    pat_header = re.compile(r"theCols\[\d+\] = new Array\s*\([\'](\d{4}\.\d{1,2})[\'],\d+,\d+\)")
    pat_line = re.compile(r"theRows\[\d+\] = new Array\s*\((.*)\).*")
    pat_code = re.compile("([A-Z]{3})")

    # Calculate zfill-digits
    zfill_digits = len(str(len(scripts)))
    print(f'Total scripts: {len(scripts)}')

    # Create output_path
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    # Define a list of dataframes:  
    #   An accumulator of all scripts
    dfs = []

    ## If you do not have tqdm installed, uncomment the 
    #  next line and comment out the following line.
    #for script_num, script in enumerate(scripts):
    for script_num, script in enumerate(tqdm(scripts, desc='Scripts Processed')):
        ## Extract: Headers, Rows
        #    Rows : code (Name: single column), line-data (multi-column)
        headers = script.strip().split('\n', 0)[0]
        headers = ['Name'] + re.findall(pat_header, headers)
        lines = re.findall(pat_line, script)
        codes = [re.findall(pat_code, line)[0] for line in lines]
        # Clean data for each row
        lines_data = dict()
        for line, code in zip(lines, codes):
            line_data = line.replace("','", "|").split('|')
            line_data[-1] = line_data[-1].replace("'", "")
            line_data[0] = code
            lines_data.update({code: line_data.copy()})
        if verbose>0:
            print('{}: {}'.format(script_num, codes))
        ## Load data into a pandas-dataframe
        #  and write to csv.
        df = pd.DataFrame(lines_data).T
        df.columns = headers
        dfs.append(df.copy()) # update list
        # Write to CSV
        if save_to_csv:
            num_label = str(script_num).zfill(zfill_digits)
            script_file_csv = f'Script_{num_label}.csv' 
            script_path = os.path.join(output_path, script_file_csv)   
            df.to_csv(script_path, index=False)
    return dfs

五,。虚拟数据

## Dummy Data
s = """
<script language="JavaScript"> var theHlp='/yardim/matris.asp';var theTitle = 'Piyasa Değeri';var theCaption='Cam (TL)';var lastmod = '';var h='<a class=hisselink href=../Hisse/HisseAnaliz.aspx?HNO=';var e='<a class=hisselink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('Hisse',4,50);theCols[1] = new Array('2012.12',1,60);theCols[2] = new Array('2013.03',1,60);theCols[3] = new Array('2013.06',1,60);theCols[4] = new Array('2013.09',1,60);theCols[5] = new Array('2013.12',1,60);theCols[6] = new Array('2014.03',1,60);theCols[7] = new Array('2014.06',1,60);theCols[8] = new Array('2014.09',1,60);theCols[9] = new Array('2014.12',1,60);theCols[10] = new Array('2015.03',1,60);theCols[11] = new Array('2015.06',1,60);theCols[12] = new Array('2015.09',1,60);theCols[13] = new Array('2015.12',1,60);theCols[14] = new Array('2016.03',1,60);theCols[15] = new Array('2016.06',1,60);theCols[16] = new Array('2016.09',1,60);theCols[17] = new Array('2016.12',1,60);theCols[18] = new Array('2017.03',1,60);theCols[19] = new Array('2017.06',1,60);theCols[20] = new Array('2017.09',1,60);theCols[21] = new Array('2017.12',1,60);theCols[22] = new Array('2018.03',1,60);theCols[23] = new Array('2018.06',1,60);theCols[24] = new Array('2018.09',1,60);theCols[25] = new Array('2018.12',1,60);theCols[26] = new Array('2019.03',1,60);theCols[27] = new Array('2019.06',1,60);theCols[28] = new Array('2019.09',1,60);theCols[29] = new Array('2019.12',1,60);theCols[30] = new Array('2020.03',1,60);var theRows = new Array();
theRows[0] = new Array ('<b>'+h+'30>ANA</B></a>','1,114,919,783.60','1,142,792,778.19','1,091,028,645.38','991,850,000.48','796,800,000.38','697,200,000.34','751,150,000.36','723,720,000.33','888,000,000.40','790,320,000.36','883,560,000.40','927,960,000.42','737,040,000.33','879,120,000.40','914,640,000.41','927,960,000.42','1,172,160,000.53','1,416,360,000.64','1,589,520,000.72','1,552,500,000.41','1,972,500,000.53','2,520,000,000.67','2,160,000,000.58','2,475,000,000.66','2,010,000,000.54','2,250,000,000.60','2,077,500,000.55','2,332,500,000.62','3,270,000,000.87','2,347,500,000.63');
theRows[1] = new Array ('<b>'+h+'89>DEN</B></a>','55,200,000.00','55,920,000.00','45,960,000.00','42,600,000.00','35,760,000.00','39,600,000.00','40,200,000.00','47,700,000.00','50,460,000.00','45,300,000.00','41,760,000.00','59,340,000.00','66,600,000.00','97,020,000.00','81,060,000.00','69,300,000.00','79,800,000.00','68,400,000.00','66,900,000.00','66,960,000.00','71,220,000.00','71,520,000.00','71,880,000.00','60,600,000.00','69,120,000.00','62,640,000.00','57,180,000.00','89,850,000.00','125,100,000.00','85,350,000.00');
theRows[2] = new Array ('<b>'+h+'269>SIS</B></a>','4,425,000,000.00','4,695,000,000.00','4,050,000,000.00','4,367,380,000.00','4,273,120,000.00','3,644,720,000.00','4,681,580,000.00','4,913,000,000.00','6,188,000,000.00','5,457,000,000.00','6,137,000,000.00','5,453,000,000.00','6,061,000,000.00','6,954,000,000.00','6,745,000,000.00','6,519,000,000.00','7,851,500,000.00','8,548,500,000.00','9,430,000,000.00','9,225,000,000.00','10,575,000,000.00','11,610,000,000.00','9,517,500,000.00','13,140,000,000.00','12,757,500,000.00','13,117,500,000.00','11,677,500,000.00','10,507,500,000.00','11,857,500,000.00','9,315,000,000.00');
theRows[3] = new Array ('<b>'+h+'297>TRK</B></a>','1,692,579,200.00','1,983,924,800.00','1,831,315,200.00','1,704,000,000.00','1,803,400,000.00','1,498,100,000.00','1,803,400,000.00','1,884,450,000.00','2,542,160,000.00','2,180,050,000.00','2,069,200,000.00','1,682,600,000.00','1,619,950,000.00','1,852,650,000.00','2,040,600,000.00','2,315,700,000.00','2,641,200,000.00','2,938,800,000.00','3,599,100,000.00','4,101,900,000.00','5,220,600,000.00','5,808,200,000.00','4,689,500,000.00','5,375,000,000.00','3,787,500,000.00','4,150,000,000.00','3,662,500,000.00','3,712,500,000.00','4,375,000,000.00','3,587,500,000.00');
var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script>
"""

## Make a dummy list of scripts
scripts = [s for _ in range(10)]

根据您问题中提供的<script>,您可以执行类似于下面代码的操作,以获得每个名称ANA, DEN ..的日期列表:

for _ in range(1, len(aaa.split("<b>'"))-1):
  s = aaa.split("<b>'")[_].split("'")
  print(_)
  lst = []
  for i in s:
    if "</B>" in i:
      name = i.split('>')[1].split("<")[0]
      print("{} = ".format(name), end="")
    if any(j.isdigit() for j in i) and ',' in i:
      lst.append(i)
  print(lst)

这只是一个示例代码,所以没有那么漂亮:)

希望这对你有帮助

相关问题 更多 >