解释有点长,但大部分是为了给我的问题提供一些背景知识(这个问题可能与Python无关,但有额外的信息不会有什么坏处):
我目前正在开发django应用程序。应用程序窗口(浏览器)有两个iFrame,每个iFrame占据屏幕的50%。左侧将显示snopes(事实检查网站)页面,右侧将显示特定snopes文章中链接的一个页面
应用程序底部的表单将允许用户选择并发布RHS页面是否是snopes文章中声明的来源(还有“无效输入”或“我不知道”)
提交调用一个函数,该函数尝试获取当前snopes页面的其他链接,如果没有链接,则至少两次、一次(即0、3、4……)获取任何带注释的页面(按此优先级)。这是通过使用count.csv来完成的,它简单地存储每个页面+链接组合被注释的数量(因为snopes文章可以重复,链接站点也可以重复)
count.csv的标题为:
page source_url count
要在任意一侧显示的页面从具有以下标题的csv中检索:
page claim verdict tags date author source_list source_url
用户输入存储在results
目录中每个用户的单独csv中,标题为:
page claim verdict tags date author source_list source_url value name
值为1(是)、2(否)、3(无效输入)、4(不知道)
第一个csv(称为samples.csv)中所有链接的html将提前检索并使用文章名作为目录名存储。页面本身存储为“page.html”,源存储为“some_number.html”,其中some_number是源列表中源的索引
例如,snopes文章中名为“Iswater-wet”的第一个链接的html将
Annotator/annotator/data/html_snopes/is-water-wet/0.html
manage.py位于注释器中
从samples(使用samples.csv创建的数据帧)中获取一行后。我的Django应用程序获取具有相同页面的所有行,并自动将没有相应路径的行注释为3(无效输入),因为这意味着html检索失败
当我在虚拟机上运行应用程序时,我注意到一个重大问题。当我与用户一起登录(到应用程序)并添加注释时,由于某种原因,相应的results
csv从316kb变为~3gb,并在应用程序终止后返回,即使csv只有大约248行。
我检查了前几行(csv结果),它们看起来完全正常
代码如下:
def get_done_by_annotator(name):
# creates a list of pages that have been already annotated by the current annotator
results_filename = results_path+name+".csv"
if os.path.exists(results_filename):
results = pd.read_csv(results_filename, sep=',', encoding="latin1")
done_by_annotator = (results["page"]+results["source_url"]).unique()
else:
done_by_annotator = []
return done_by_annotator
def get_count_file(s_p):
#Creates or reads countfile:
if os.path.exists(count_path):
count_file = pd.read_csv(count_path, sep=',', encoding="latin1").sample(frac=1)
else:
count_file = s_p[['page','source_url']].copy()
count_file['count'] = 0
count_file.to_csv(count_path, sep=',', index=False)
return count_file
def increase_page_annotation_count(page, origin):
count_file = pd.read_csv(count_path, sep=',', encoding="latin1")
count_file.loc[(count_file['page'] == page) & (count_file['source_url'] == origin), 'count'] += 1
count_file.to_csv(count_path, sep=',', index=False)
def save_annotation(page, origin, value, name):
# Read samples file
print("SAVING ANNOTATION")
s_p = pd.read_csv(samples_path, sep='\t', encoding="latin1")
entry = s_p.loc[(s_p["page"] == page) & (s_p["source_url"] == origin)]
if not (entry.empty):
n_entry = entry.values.tolist()[0]
n_entry.extend([value, name])
results_filename = results_path+name+".csv"
if os.path.exists(results_filename):
results = pd.read_csv(results_filename, sep=',', encoding="latin1")
else:
results = pd.DataFrame(columns=res_header)
oldEntry = results.loc[(results["page"] == page) & (results["source_url"] == origin)]
if oldEntry.empty:
results.loc[len(results)] = n_entry
results.to_csv(results_filename, sep=',', index=False)
# keeps track of how many times page was annotated
increase_page_annotation_count(page, origin)
def get_least_annotated_page(name,aPage=None):
done_by_annotator = get_done_by_annotator(name)
#Print number of annotated pages and total number of pages
s_p = pd.read_csv(samples_path, sep='\t', encoding="latin1")
print("done: ", len(done_by_annotator), " | total: ", len(s_p))
if len(done_by_annotator) == len(s_p):
return "Last annotation done! Thank you!", None, None, None, None, None, None, None
#Creates or reads countfile:
count_file = get_count_file(s_p)
#Get pages not done by current annotator
not_done_count = count_file.loc[~(count_file['page']+count_file['source_url']).isin(done_by_annotator)]
print(">>",aPage)
if aPage is not None:
remOrigins = not_done_count.loc[not_done_count['page'] == aPage]
if len(remOrigins)==0:
return get_least_annotated_page(name)
else:
twice_annotated = not_done_count.loc[not_done_count['count'] == 2]
if len(twice_annotated) > 0:
page = twice_annotated.iloc[0]['page']
else:
once_annotated = not_done_count.loc[not_done_count['count'] == 1]
if len(once_annotated) > 0:
page = once_annotated.iloc[0]['page']
else:
index = not_done_count['count'].idxmin(axis=0, skipna=True)
page = not_done_count.loc[index]['page']
remOrigins = not_done_count.loc[not_done_count['page'] == page]
page = remOrigins.iloc[0].page
#Automatically annotate broken links of this page as invalid input (op = 3)
src_lst = s_p.loc[s_p['page'] == page]
src_lst = ast.literal_eval(src_lst.iloc[0].source_list)
for idx, e in remOrigins.iterrows():
src_idx_num = src_lst.index(e.source_url)
if not (os.path.exists(snopes_path+(e.page.strip("/").split("/")[-1]+"/")+str(src_idx_num)+".html")):
save_annotation(e.page, e.source_url, "3", name)
#Update done_by_annotator, count_file, and not_done_count
done_by_annotator = get_done_by_annotator(name)
count_file = get_count_file(s_p)
not_done_count = count_file.loc[~(count_file['page']+count_file['source_url']).isin(done_by_annotator)]
remOrigins = not_done_count.loc[not_done_count['page'] == page]
if len(remOrigins)==0:
return get_least_annotated_page(name)
entry = remOrigins.iloc[0]
entry = s_p[(s_p.page.isin([entry.page]) & s_p.source_url.isin([entry.source_url]))].iloc[0]
a_page = entry.page.strip()
o_page = entry.source_url.strip()
src_lst = entry.source_list.strip()
a_page_path = a_page.strip("/").split("/")[-1]+"/"
src_idx_num = src_lst.index(o_page)
o_page_path = a_page_path+str(src_idx_num)+".html"
f = codecs.open(snopes_path+a_page_path+"page.html", encoding='utf-8')
a_html = bs(f.read(),"lxml")
f = codecs.open(snopes_path+o_page_path, encoding='utf-8')
o_html = bs(f.read(),"lxml")
return a_page, o_page, str(a_html), str(o_html), src_lst, a_done, a_total, len(done_by_annotator)
目前没有回答
相关问题 更多 >
编程相关推荐