探索性数据分析工具。

edap的Python项目详细描述


Python SupportBuild StatusCoverage Status

edapy是分析新数据集的第一个资源。

安装

$ pip install git+https://github.com/MartinThoma/edapy.git

对于pdf部分,您还需要pdftotext

$ sudo apt-get install poppler-utils

用法

$ edapy --help
Usage: edapy [OPTIONS] COMMAND [ARGS]...

  edapy is a tool for exploratory data analysis with Python.

  You can use it to get a first idea what a CSV is about or to get an
  overview over a directory of PDF files.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  csv     Analyze CSV files.
  images  Analyze image files.
  pdf     Analyze PDF files.

工作流程如下:

  • edapy pdf find --path . --output results.csv创建一个results.csv 为你。此results.csv包含有关 path目录。
  • edapy csv predict --csv_path my-new.csv --types types.yaml将启动/ 继续一个过程,在这个过程中,用户被引导通过一系列问题。在 这些问题,用户必须决定使用哪一个分隔符quotechar 以及列具有哪些类型。
  • edapy生成可用于加载csv的types.yaml文件 使用df = edapy.load_csv(csv_path, yaml_path)的其他应用程序。

示例类型.yaml

对于Titanic Dataset,结果是 types.yaml如下所示:

columns:
- dtype: other
  name: Name
- dtype: int
  name: Parch
- dtype: float
  name: Age
- dtype: other
  name: Ticket
- dtype: float
  name: Fare
- dtype: int
  name: PassengerId
- dtype: other
  name: Cabin
- dtype: other
  name: Embarked
- dtype: int
  name: Pclass
- dtype: int
  name: Survived
- dtype: other
  name: Sex
- dtype: int
  name: SibSp
csv_meta:
  delimiter: ','

然后,运行示例如下:

$ edapy csv predict --types types_titanik.yaml --csv_path train.csv
Number of datapoints: 891
2018-04-16 21:51:56,279 WARNING Column 'Survived' has only 2 different values ([0, 1]). You might want to make it a 'category'
2018-04-16 21:51:56,280 WARNING Column 'Pclass' has only 3 different values ([3, 1, 2]). You might want to make it a 'category'
2018-04-16 21:51:56,281 WARNING Column 'Sex' has only 2 different values (['male', 'female']). You might want to make it a 'category'
2018-04-16 21:51:56,282 WARNING Column 'SibSp' has only 7 different values ([0, 1, 2, 4, 3, 8, 5]). You might want to make it a 'category'
2018-04-16 21:51:56,283 WARNING Column 'Parch' has only 7 different values ([0, 1, 2, 5, 3, 4, 6]). You might want to make it a 'category'
2018-04-16 21:51:56,285 WARNING Column 'Embarked' has only 3 different values (['S', 'C', 'Q']). You might want to make it a 'category'

## Integer Columns
Column name: Non-nan  mean   std   min   25%   50%   75%   max
PassengerId:     891  446.00  257.35     1   224   446   668   891
Survived   :     891  0.38  0.49     0     0     0     1     1
Pclass     :     891  2.31  0.84     1     2     3     3     3
SibSp      :     891  0.52  1.10     0     0     0     1     8
Parch      :     891  0.38  0.81     0     0     0     0     6

## Float Columns
Column name: Non-nan   mean    std    min    25%    50%    75%    max
Age        :     714  29.70  14.53   0.42  20.12  28.00  38.00  80.00
Fare       :     891  32.20  49.69   0.00   7.91  14.45  31.00  512.33

## Other Columns
Column name: Non-nan   unique   top (count)
Name       :     891      891   Goldschmidt, Mr. George B (1)
Sex        :     891        2   male (577)
Ticket     :     891      681   347082 (7)
Cabin      :     204      148   C23 C25 C27 (4)
Embarked   :     889        4   S (644)

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
Java Date作为MyBatis中Oracle SELECT查询的参数[not get Response]   来自两个独立线程的并行java访问队列   如果数据已经存在,java Android Studio SQLite不会将数据插入数据库   mysql java spring项目仅在第一次运行时显示错误,再次运行后运行正常。为什么呢?   java SQL错误:1364,SQLState:HY000字段“rating_id”没有默认值/保存具有onetoone关系的子实体时   Tomcat中无cookie的java支持会话#重复   JAVAlang.RuntimeException:Android Studio   java CheckboxMultipleChice存储在SQL中   Kafka Java SimpleConsumer奇怪的编码   使用Hibernate保存servlet中处理的数据时遇到java错误   JavaSpring在运行时添加数据源   java使用一个类中另一个类的方法   java空值随Spring Rest资源更新   java Spring引导:为什么要使用OncePerRequestFilter?   java Android异步任务重用   java JTextField未按预期填充列?