从1993年起从Edgar下载SEC填充指数
python-edgar的Python项目详细描述
用python-edgar
自1993年(1993-qtr1,1993-qtr2…)以来,美国证券交易委员会(SEC)的备案指数分为季度文件。通过使用python-edgar
和一些脚本,您可以通过将季度索引文件缝合在一起,轻松地重建自1993年以来所有归档的主索引。然后,主索引文件可以馈送到数据库、pandas数据帧、stata等。
索引文件是一个类似csv的(管道|
分隔)文件,包含以下信息:
- 公司名称(例如
TWITTER, INC
) - 公司CIK(例如
0001418091
) - 填写日期(例如
2013-10-03
) - 填充类型(例如
S1
) - 在edgar上填写url(
edgar/data/1418091/0001193125-13-390321.txt
)
下载完索引文件python-edgar
后,可以使用csv.csvreader
或pandas.read_csv
打开索引文件,以使数据以编程方式可用。请记住分隔符是|
!
python-edgar
可以用作从另一个python脚本调用的库,也可以用作独立脚本。
功能
- 快速:使用
multiprocessing
并行下载。你拥有的CPU越多,它就越快。 - 高效:检索压缩的存档文件,而不是10倍大的原始索引文件
- 作为python项目中的库导入或作为独立脚本运行
- 与外部0依赖项兼容的Python2&3。
用法
使用python edgar作为库
从pip在virtualenv中安装
pip install python-edgar
呼叫图书馆
importedgaredgar.download_index(download_directory,since_year)
输出
2018-06-23 12:41:46,451 - DEBUG - downloads will be saved to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o 2018-06-23 12:41:46,451 - DEBUG - downloading files since 20172018-06-23 12:41:46,451 - INFO - 6 index files to retrieve 2018-06-23 12:41:46,465 - DEBUG - worker count: 42018-06-23 12:41:48,359 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR3/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR3.tsv 2018-06-23 12:41:48,611 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2018-QTR2.tsv 2018-06-23 12:41:48,649 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR4/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR4.tsv 2018-06-23 12:41:48,935 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2018-QTR1.tsv 2018-06-23 12:41:49,750 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR2.tsv 2018-06-23 12:41:50,237 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR1.tsv 2018-06-23 12:41:50,376 - INFO - complete2018-06-23 12:41:50,377 - INFO - Files downloaded in /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o
使用python edgar作为独立脚本
- 以zip格式下载此存储库(“克隆或下载”绿色按钮,>;以zip格式下载)。
- 在该目录中打开终端并运行
python run.py -h
。可以为下载的索引文件指定目标目录,如-d edgar-idx
(默认为临时目录)和/或指定要用-y 2017
(默认为当前年份)生成索引的年份。
$ python run.py -y 20172018-06-23 12:41:46,451 - DEBUG - downloads will be saved to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o 2018-06-23 12:41:46,451 - DEBUG - downloading files since 20172018-06-23 12:41:46,451 - INFO - 6 index files to retrieve 2018-06-23 12:41:46,465 - DEBUG - worker count: 42018-06-23 12:41:48,359 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR3/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR3.tsv 2018-06-23 12:41:48,611 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2018-QTR2.tsv 2018-06-23 12:41:48,649 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR4/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR4.tsv 2018-06-23 12:41:48,935 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2018-QTR1.tsv 2018-06-23 12:41:49,750 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR2.tsv 2018-06-23 12:41:50,237 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o/2017-QTR1.tsv 2018-06-23 12:41:50,376 - INFO - complete2018-06-23 12:41:50,377 - INFO - Files downloaded in /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpr2Nk3o
将季度文件缝合到主文件
python-edgar
只做了一件事并且做得很好:获取并清理未压缩的季度索引文件到您的计算机。本着unix理念的精神,使用命令行工具将这些索引文件缝合在一起,并创建主索引文件。
在这个例子中,我们调用了python run.py
而没有参数。它将下载1993年以来每个季度的索引文件。
python run.py -y 19932018-06-23 13:00:16,855 - DEBUG - downloads will be saved to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7 2018-06-23 13:00:16,855 - DEBUG - downloading files since 19932018-06-23 13:00:16,856 - INFO - 102 index files to retrieve 2018-06-23 13:00:16,879 - DEBUG - worker count: 42018-06-23 13:00:18,814 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR4/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2017-QTR4.tsv 2018-06-23 13:00:19,026 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR3/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2017-QTR3.tsv 2018-06-23 13:00:19,157 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2018-QTR2.tsv 2018-06-23 13:00:19,543 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2018-QTR1.tsv 2018-06-23 13:00:20,521 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2017-QTR2.tsv 2018-06-23 13:00:20,719 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2016/QTR4/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2016-QTR4.tsv 2018-06-23 13:00:21,016 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2016/QTR3/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2016-QTR3.tsv 2018-06-23 13:00:21,134 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2017/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2017-QTR1.tsv 2018-06-23 13:00:22,099 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/2016/QTR2/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/2016-QTR2.tsv (...) dcw07x6zrrr0000gn/T/tmpcF1rx7/1993-QTR2.tsv 2018-06-23 13:00:54,378 - INFO - > downloaded https://www.sec.gov/Archives/edgar/full-index/1993/QTR1/master.zip to /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7/1993-QTR1.tsv 2018-06-23 13:00:54,423 - INFO - complete2018-06-23 13:00:54,424 - INFO - Files downloaded in /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7
检查下载文件的目录:
$ ls -lh /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7 total 4964656 drwx------ 104 eswiac staff 3.3K Jun 2313:00 . drwxr-xr-x 342 eswiac staff 11K Jun 2313:01 .. -rw-r--r-- 1 eswiac staff 585B Jun 2313:00 1993-QTR1.tsv -rw-r--r-- 1 eswiac staff 580B Jun 2313:00 1993-QTR2.tsv -rw-r--r-- 1 eswiac staff 1.0K Jun 2313:00 1993-QTR3.tsv -rw-r--r-- 1 eswiac staff 2.8K Jun 2313:00 1993-QTR4.tsv -rw-r--r-- 1 eswiac staff 2.9M Jun 2313:00 1994-QTR1.tsv -rw-r--r-- 1 eswiac staff 2.3M Jun 2313:00 1994-QTR2.tsv (...) -rw-r--r-- 1 eswiac staff 27M Jun 2313:00 2017-QTR3.tsv -rw-r--r-- 1 eswiac staff 27M Jun 2313:00 2017-QTR4.tsv -rw-r--r-- 1 eswiac staff 41M Jun 2313:00 2018-QTR1.tsv -rw-r--r-- 1 eswiac staff 31M Jun 2313:00 2018-QTR2.tsv
前往该目录,以便我们可以使用cat
$ cd /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpcF1rx7 $ cat *.tsv > master.tsv $ du -h master.tsv 2.3G master.tsv
现在你有了这个主索引文件。它不排序,但很容易做到(提示:查看sort
命令)
从特定公司获取文件
现在我们已经下载了索引文件,只需一点命令行脚本,就可以很容易地按公司快速筛选,并将url提取到我们想要的grep
文件中。在下面的示例中,我们通过cik(1000045)进行grep,将输出存储在一个中间文本文件中,然后使用cat打开该文件,并通过表单10-k再次进行grep。在路径前面加上https://www.sec.gov/Archives/,您将得到完整的url。
eswiac@mbp python-edgar (master) $ grep -h 1000045 /var/folders/bv/2zbdkyyj14766dcw07x6zrrr0000gn/T/tmpvwOzOU/* > 1000045.txt eswiac@mbp python-edgar (master) $ cat 1000045.txt | grep -h 10-K 1000045|NICHOLAS FINANCIAL INC|10-K|2015-06-15|edgar/data/1000045/0001193125-15-223218.txt|edgar/data/1000045/0001193125-15-223218-index.html 1000045|NICHOLAS FINANCIAL INC|10-K|2016-06-14|edgar/data/1000045/0001193125-16-620952.txt|edgar/data/1000045/0001193125-16-620952-index.html 1000045|NICHOLAS FINANCIAL INC|10-K|2017-06-14|edgar/data/1000045/0001193125-17-203193.txt|edgar/data/1000045/0001193125-17-203193-index.html 1000045|NICHOLAS FINANCIAL INC|10-K|2018-06-27|edgar/data/1000045/0001193125-18-205637.txt|edgar/data/1000045/0001193125-18-205637-index.html
使用q
查询主索引
https://github.com/harelba/q允许直接在表格数据上运行sql。
小心使用:q不使用索引,因此对主索引运行查询将非常缓慢,因为它相当大。排序主索引或将数据缩小到较小的子集将使搜索更快。最终,您需要将主索引文件加载到能够处理大小的适当数据库中。
您可能需要尝试一些查询
q "SELECT COUNT(1) FROM 1999-QTR4.tsv"
q -d"|" "SELECT * FROM master.tsv where c1 = 1418091 and c3 = '10-Q' order by c4"
许可证
麻省理工学院