在Django应用程序运行期间,csv文件从316 KB变为~3GB,然后在应用程序终止后立即返回

2024-05-23 23:33:59 发布

您现在位置:Python中文网/ 问答频道 /正文

解释有点长,但大部分是为了给我的问题提供一些背景知识(这个问题可能与Python无关,但有额外的信息不会有什么坏处):

我目前正在开发django应用程序。应用程序窗口(浏览器)有两个iFrame,每个iFrame占据屏幕的50%。左侧将显示snopes(事实检查网站)页面,右侧将显示特定snopes文章中链接的一个页面

应用程序底部的表单将允许用户选择并发布RHS页面是否是snopes文章中声明的来源(还有“无效输入”或“我不知道”)

提交调用一个函数,该函数尝试获取当前snopes页面的其他链接,如果没有链接,则至少两次、一次(即0、3、4……)获取任何带注释的页面(按此优先级)。这是通过使用count.csv来完成的,它简单地存储每个页面+链接组合被注释的数量(因为snopes文章可以重复,链接站点也可以重复)

count.csv的标题为:

page source_url count

要在任意一侧显示的页面从具有以下标题的csv中检索:

page claim verdict tags date author source_list source_url

用户输入存储在results目录中每个用户的单独csv中,标题为:

page claim verdict tags date author source_list source_url value name

值为1(是)、2(否)、3(无效输入)、4(不知道)

第一个csv(称为samples.csv)中所有链接的html将提前检索并使用文章名作为目录名存储。页面本身存储为“page.html”,源存储为“some_number.html”,其中some_number是源列表中源的索引

例如,snopes文章中名为“Iswater-wet”的第一个链接的html将

Annotator/annotator/data/html_snopes/is-water-wet/0.html

manage.py位于注释器中

从samples(使用samples.csv创建的数据帧)中获取一行后。我的Django应用程序获取具有相同页面的所有行,并自动将没有相应路径的行注释为3(无效输入),因为这意味着html检索失败

当我在虚拟机上运行应用程序时,我注意到一个重大问题。当我与用户一起登录(到应用程序)并添加注释时,由于某种原因,相应的resultscsv从316kb变为~3gb,并在应用程序终止后返回,即使csv只有大约248行。

我检查了前几行(csv结果),它们看起来完全正常

代码如下:

def get_done_by_annotator(name):
    # creates a list of pages that have been already annotated by the current annotator
    results_filename = results_path+name+".csv"
    if os.path.exists(results_filename):
        results = pd.read_csv(results_filename, sep=',', encoding="latin1")
        done_by_annotator = (results["page"]+results["source_url"]).unique()
    else:
        done_by_annotator = []
    return done_by_annotator
def get_count_file(s_p):
    #Creates or reads countfile:
    if os.path.exists(count_path):
        count_file = pd.read_csv(count_path, sep=',', encoding="latin1").sample(frac=1)
    else:
        count_file = s_p[['page','source_url']].copy()
        count_file['count'] = 0
        count_file.to_csv(count_path, sep=',', index=False)
    return count_file
def increase_page_annotation_count(page, origin):
    count_file = pd.read_csv(count_path, sep=',', encoding="latin1")
    count_file.loc[(count_file['page'] == page) & (count_file['source_url'] == origin), 'count'] += 1
    count_file.to_csv(count_path, sep=',', index=False)
def save_annotation(page, origin, value, name):
    # Read samples file
    print("SAVING ANNOTATION")
    s_p = pd.read_csv(samples_path, sep='\t', encoding="latin1")
    entry = s_p.loc[(s_p["page"] == page) & (s_p["source_url"] == origin)]
    if not (entry.empty):
        n_entry = entry.values.tolist()[0]
        n_entry.extend([value, name])
        results_filename = results_path+name+".csv"
        if os.path.exists(results_filename):
            results = pd.read_csv(results_filename, sep=',', encoding="latin1")
        else:
            results = pd.DataFrame(columns=res_header)
        oldEntry = results.loc[(results["page"] == page) & (results["source_url"] == origin)]
        if oldEntry.empty:
            results.loc[len(results)] = n_entry
        results.to_csv(results_filename, sep=',', index=False)
        # keeps track of how many times page was annotated
        increase_page_annotation_count(page, origin)
def get_least_annotated_page(name,aPage=None):
    done_by_annotator = get_done_by_annotator(name)

    #Print number of annotated pages and total number of pages
    s_p = pd.read_csv(samples_path, sep='\t', encoding="latin1")
    print("done: ", len(done_by_annotator), " | total: ", len(s_p))

    if len(done_by_annotator) == len(s_p):
        return "Last annotation done! Thank you!", None, None, None, None, None, None, None

    #Creates or reads countfile:
    count_file = get_count_file(s_p)

    #Get pages not done by current annotator
    not_done_count = count_file.loc[~(count_file['page']+count_file['source_url']).isin(done_by_annotator)]


    print(">>",aPage)
    if aPage is not None:
        remOrigins = not_done_count.loc[not_done_count['page'] == aPage]
        if len(remOrigins)==0:
            return get_least_annotated_page(name)
    else:
        twice_annotated = not_done_count.loc[not_done_count['count'] == 2]
        if len(twice_annotated) > 0:
            page = twice_annotated.iloc[0]['page']
        else:    
            once_annotated = not_done_count.loc[not_done_count['count'] == 1]
            if len(once_annotated) > 0:
                page = once_annotated.iloc[0]['page']
            else:
                index = not_done_count['count'].idxmin(axis=0, skipna=True)
                page = not_done_count.loc[index]['page']
        remOrigins = not_done_count.loc[not_done_count['page'] == page]

    page = remOrigins.iloc[0].page
    #Automatically annotate broken links of this page as invalid input (op = 3)
    src_lst = s_p.loc[s_p['page'] == page]
    src_lst = ast.literal_eval(src_lst.iloc[0].source_list)
    for idx, e in remOrigins.iterrows():
        src_idx_num = src_lst.index(e.source_url)
        if not (os.path.exists(snopes_path+(e.page.strip("/").split("/")[-1]+"/")+str(src_idx_num)+".html")):
            save_annotation(e.page, e.source_url, "3", name)

    #Update done_by_annotator, count_file, and not_done_count
    done_by_annotator = get_done_by_annotator(name)
    count_file = get_count_file(s_p)
    not_done_count = count_file.loc[~(count_file['page']+count_file['source_url']).isin(done_by_annotator)]

    remOrigins = not_done_count.loc[not_done_count['page'] == page]
    if len(remOrigins)==0:
        return get_least_annotated_page(name)

    entry = remOrigins.iloc[0]
    entry = s_p[(s_p.page.isin([entry.page]) & s_p.source_url.isin([entry.source_url]))].iloc[0]
    a_page = entry.page.strip()
    o_page = entry.source_url.strip()
    src_lst = entry.source_list.strip()

    a_page_path = a_page.strip("/").split("/")[-1]+"/"
    src_idx_num = src_lst.index(o_page)
    o_page_path = a_page_path+str(src_idx_num)+".html"

    f = codecs.open(snopes_path+a_page_path+"page.html", encoding='utf-8')
    a_html = bs(f.read(),"lxml")
    f = codecs.open(snopes_path+o_page_path, encoding='utf-8')
    o_html = bs(f.read(),"lxml")

    return a_page, o_page, str(a_html), str(o_html), src_lst, a_done, a_total, len(done_by_annotator)

Tags: csvpathurlsourcebyhtmlcountpage