从1993年起从Edgar下载SEC填充指数

python-edgar的Python项目详细描述


Build StatusPyPIPyPI - LicensePyPI - Python Version

python-edgar

自1993年(1993-qtr1,1993-qtr2…)以来,美国证券交易委员会(SEC)的备案指数分为季度文件。通过使用python-edgar和一些脚本,您可以通过将季度索引文件缝合在一起,轻松地重建自1993年以来所有归档的主索引。然后,主索引文件可以馈送到数据库、pandas数据帧、stata等。

索引文件是一个类似csv的(管道|分隔)文件,包含以下信息:

  • 公司名称(例如TWITTER, INC
  • 公司CIK(例如0001418091
  • 填写日期(例如2013-10-03
  • 填充类型(例如S1
  • 在edgar上填写url(edgar/data/1418091/0001193125-13-390321.txt

下载完索引文件python-edgar后,可以使用csv.csvreaderpandas.read_csv打开索引文件,以使数据以编程方式可用。请记住分隔符是|

python-edgar可以用作从另一个python脚本调用的库,也可以用作独立脚本。

功能

  • 快速:使用multiprocessing并行下载。你拥有的CPU越多,它就越快。
  • 高效:检索压缩的存档文件,而不是10倍大的原始索引文件
  • 作为python项目中的库导入或作为独立脚本运行
  • 与外部0依赖项兼容的Python2&3。

用法

使用python edgar作为库

从pip在virtualenv中安装

pip install python-edgar

呼叫图书馆

importedgaredgar.download_index(download_directory,since_year)

输出

2018-06-23 12:41:46,451 - DEBUG - downloads will be saved to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o
2018-06-23 12:41:46,451 - DEBUG - downloading files since 20172018-06-23 12:41:46,451 - INFO - 6 index files to retrieve
2018-06-23 12:41:46,465 - DEBUG - worker count: 42018-06-23 12:41:48,359 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR3/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR3.tsv
2018-06-23 12:41:48,611 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2018-QTR2.tsv
2018-06-23 12:41:48,649 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR4/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR4.tsv
2018-06-23 12:41:48,935 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2018-QTR1.tsv
2018-06-23 12:41:49,750 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR2.tsv
2018-06-23 12:41:50,237 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR1.tsv
2018-06-23 12:41:50,376 - INFO - complete2018-06-23 12:41:50,377 - INFO - Files downloaded in /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o

使用python edgar作为独立脚本

  • 以zip格式下载此存储库(“克隆或下载”绿色按钮,>;以zip格式下载)。
  • 在该目录中打开终端并运行python run.py -h。可以为下载的索引文件指定目标目录,如-d edgar-idx(默认为临时目录)和/或指定要用-y 2017(默认为当前年份)生成索引的年份。
 $ python run.py -y 20172018-06-23 12:41:46,451 - DEBUG - downloads will be saved to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o
2018-06-23 12:41:46,451 - DEBUG - downloading files since 20172018-06-23 12:41:46,451 - INFO - 6 index files to retrieve
2018-06-23 12:41:46,465 - DEBUG - worker count: 42018-06-23 12:41:48,359 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR3/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR3.tsv
2018-06-23 12:41:48,611 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2018-QTR2.tsv
2018-06-23 12:41:48,649 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR4/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR4.tsv
2018-06-23 12:41:48,935 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2018-QTR1.tsv
2018-06-23 12:41:49,750 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR2.tsv
2018-06-23 12:41:50,237 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR1.tsv
2018-06-23 12:41:50,376 - INFO - complete2018-06-23 12:41:50,377 - INFO - Files downloaded in /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o

将季度文件缝合到主文件

python-edgar只做了一件事并且做得很好:获取并清理未压缩的季度索引文件到您的计算机。本着unix理念的精神,使用命令行工具将这些索引文件缝合在一起,并创建主索引文件。

在这个例子中,我们调用了python run.py而没有参数。它将下载1993年以来每个季度的索引文件。

 python run.py -y 19932018-06-23 13:00:16,855 - DEBUG - downloads will be saved to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7
2018-06-23 13:00:16,855 - DEBUG - downloading files since 19932018-06-23 13:00:16,856 - INFO - 102 index files to retrieve
2018-06-23 13:00:16,879 - DEBUG - worker count: 42018-06-23 13:00:18,814 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR4/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2017-QTR4.tsv
2018-06-23 13:00:19,026 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR3/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2017-QTR3.tsv
2018-06-23 13:00:19,157 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2018-QTR2.tsv
2018-06-23 13:00:19,543 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2018-QTR1.tsv
2018-06-23 13:00:20,521 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2017-QTR2.tsv
2018-06-23 13:00:20,719 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2016/QTR4/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2016-QTR4.tsv
2018-06-23 13:00:21,016 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2016/QTR3/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2016-QTR3.tsv
2018-06-23 13:00:21,134 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2017-QTR1.tsv
2018-06-23 13:00:22,099 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2016/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2016-QTR2.tsv
(...)
dcw07x6zrrr0000gn/T/tmpcF1rx7/1993-QTR2.tsv
2018-06-23 13:00:54,378 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/1993/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/1993-QTR1.tsv
2018-06-23 13:00:54,423 - INFO - complete2018-06-23 13:00:54,424 - INFO - Files downloaded in /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7

检查下载文件的目录:

$ ls -lh /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7
total 4964656
drwx------  104 eswiac  staff   3.3K Jun 2313:00 .
drwxr-xr-x  342 eswiac  staff    11K Jun 2313:01 ..
-rw-r--r--    1 eswiac  staff   585B Jun 2313:00 1993-QTR1.tsv
-rw-r--r--    1 eswiac  staff   580B Jun 2313:00 1993-QTR2.tsv
-rw-r--r--    1 eswiac  staff   1.0K Jun 2313:00 1993-QTR3.tsv
-rw-r--r--    1 eswiac  staff   2.8K Jun 2313:00 1993-QTR4.tsv
-rw-r--r--    1 eswiac  staff   2.9M Jun 2313:00 1994-QTR1.tsv
-rw-r--r--    1 eswiac  staff   2.3M Jun 2313:00 1994-QTR2.tsv
(...)
-rw-r--r--    1 eswiac  staff    27M Jun 2313:00 2017-QTR3.tsv
-rw-r--r--    1 eswiac  staff    27M Jun 2313:00 2017-QTR4.tsv
-rw-r--r--    1 eswiac  staff    41M Jun 2313:00 2018-QTR1.tsv
-rw-r--r--    1 eswiac  staff    31M Jun 2313:00 2018-QTR2.tsv

前往该目录,以便我们可以使用cat

将这些文件合并到主文件中。
$ cd  /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7
$ cat *.tsv > master.tsv
$ du -h master.tsv
2.3G	master.tsv

现在你有了这个主索引文件。它不排序,但很容易做到(提示:查看sort命令)

从特定公司获取文件

现在我们已经下载了索引文件,只需一点命令行脚本,就可以很容易地按公司快速筛选,并将url提取到我们想要的grep文件中。在下面的示例中,我们通过cik(1000045)进行grep,将输出存储在一个中间文本文件中,然后使用cat打开该文件,并通过表单10-k再次进行grep。在路径前面加上https://www.sec.gov/Archives/,您将得到完整的url。

eswiac@mbp python-edgar (master) $ grep -h 1000045 /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpvwOzOU/* > 1000045.txt
eswiac@mbp python-edgar (master) $ cat 1000045.txt | grep -h 10-K
1000045|NICHOLAS FINANCIAL INC|10-K|2015-06-15|edgar/data/1000045/0001193125-15-223218.txt|edgar/data/1000045/0001193125-15-223218-index.html
1000045|NICHOLAS FINANCIAL INC|10-K|2016-06-14|edgar/data/1000045/0001193125-16-620952.txt|edgar/data/1000045/0001193125-16-620952-index.html
1000045|NICHOLAS FINANCIAL INC|10-K|2017-06-14|edgar/data/1000045/0001193125-17-203193.txt|edgar/data/1000045/0001193125-17-203193-index.html
1000045|NICHOLAS FINANCIAL INC|10-K|2018-06-27|edgar/data/1000045/0001193125-18-205637.txt|edgar/data/1000045/0001193125-18-205637-index.html

使用q

查询主索引

https://github.com/harelba/q允许直接在表格数据上运行sql。

小心使用:q不使用索引,因此对主索引运行查询将非常缓慢,因为它相当大。排序主索引或将数据缩小到较小的子集将使搜索更快。最终,您需要将主索引文件加载到能够处理大小的适当数据库中。

您可能需要尝试一些查询

  • q "SELECT COUNT(1) FROM 1999-QTR4.tsv"
  • q -d"|" "SELECT * FROM master.tsv where c1 = 1418091 and c3 = '10-Q' order by c4"

许可证

麻省理工学院

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java如何将jaxb插件扩展与gradlejaxbplugin一起使用   java Hibernate列表<Object[]>到特定对象   java使用多态性显示arraylist的输出   java水平堆叠卡,带有一定偏移量   java错误:找不到符号方法liesInt()   java客户机/服务器文件收发中的多线程流管理   在java中可以基于访问重载方法吗?   包含空元素的java排序数组   swing Java按钮/网格布局   java BottomNavigationView getmaxitemcount   java空指针异常字符串生成器   java Xamarin升级导致“类文件版本错误52.0,应为50.0”错误   java我正在尝试打印它,而不只是对每一行进行println   Tomcat7中的java是否需要复制上下文。将xml转换为conf/Catalina/locahost以使其生效   带有注入服务的java REST端点在何处引发自定义WebServiceException?   在Java中使用GPS数据   java如何将JFreeChart ChartPanel导出到包含添加的CrosshairOverlay的图像对象?   内置Eclipse期间的Java 8堆栈溢出   java在GWT编译的JavaScript中如何表示BigDecimal