跳转到主要内容

探索性数据分析工具。

项目描述

PyPI version Python Support Build Status Coverage Status Code style: black GitHub last commit GitHub commits since latest release (by SemVer) CodeFactor

edapy是分析新数据集的第一个资源。

安装

$ pip install git+https://github.com/MartinThoma/edapy.git

对于pdf部分,您还需要pdftotext

$ sudo apt-get install poppler-utils

用法

$ edapy --help
Usage: edapy [OPTIONS] COMMAND [ARGS]...

  edapy is a tool for exploratory data analysis with Python.

  You can use it to get a first idea what a CSV is about or to get an
  overview over a directory of PDF files.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  csv     Analyze CSV files.
  images  Analyze image files.
  pdf     Analyze PDF files.

工作流程如下

  • edapy pdf find --path . --output results.csv为您创建一个results.csv文件。此results.csv文件包含path目录中所有PDF文件的元数据。
  • edapy csv predict --csv_path my-new.csv --types types.yaml将启动/恢复一个过程,引导用户通过一系列问题。在这些问题中,用户必须决定使用哪种分隔符、quotechar以及列的类型。
  • edapy生成一个types.yaml文件,该文件可用于在其他应用程序中使用df = edapy.load_csv(csv_path, yaml_path)加载CSV文件。

示例types.yaml

对于泰坦尼克数据集,生成的types.yaml如下所示

columns:
- dtype: other
  name: Name
- dtype: int
  name: Parch
- dtype: float
  name: Age
- dtype: other
  name: Ticket
- dtype: float
  name: Fare
- dtype: int
  name: PassengerId
- dtype: other
  name: Cabin
- dtype: other
  name: Embarked
- dtype: int
  name: Pclass
- dtype: int
  name: Survived
- dtype: other
  name: Sex
- dtype: int
  name: SibSp
csv_meta:
  delimiter: ','

示例运行可能如下所示

$ edapy csv predict --types types_titanik.yaml --csv_path train.csv
Number of datapoints: 891
2018-04-16 21:51:56,279 WARNING Column 'Survived' has only 2 different values ([0, 1]). You might want to make it a 'category'
2018-04-16 21:51:56,280 WARNING Column 'Pclass' has only 3 different values ([3, 1, 2]). You might want to make it a 'category'
2018-04-16 21:51:56,281 WARNING Column 'Sex' has only 2 different values (['male', 'female']). You might want to make it a 'category'
2018-04-16 21:51:56,282 WARNING Column 'SibSp' has only 7 different values ([0, 1, 2, 4, 3, 8, 5]). You might want to make it a 'category'
2018-04-16 21:51:56,283 WARNING Column 'Parch' has only 7 different values ([0, 1, 2, 5, 3, 4, 6]). You might want to make it a 'category'
2018-04-16 21:51:56,285 WARNING Column 'Embarked' has only 3 different values (['S', 'C', 'Q']). You might want to make it a 'category'

## Integer Columns
Column name: Non-nan  mean   std   min   25%   50%   75%   max
PassengerId:     891  446.00  257.35     1   224   446   668   891
Survived   :     891  0.38  0.49     0     0     0     1     1
Pclass     :     891  2.31  0.84     1     2     3     3     3
SibSp      :     891  0.52  1.10     0     0     0     1     8
Parch      :     891  0.38  0.81     0     0     0     0     6

## Float Columns
Column name: Non-nan   mean    std    min    25%    50%    75%    max
Age        :     714  29.70  14.53   0.42  20.12  28.00  38.00  80.00
Fare       :     891  32.20  49.69   0.00   7.91  14.45  31.00  512.33

## Other Columns
Column name: Non-nan   unique   top (count)
Name       :     891      891   Goldschmidt, Mr. George B (1)
Sex        :     891        2   male (577)
Ticket     :     891      681   347082 (7)
Cabin      :     204      148   C23 C25 C27 (4)
Embarked   :     889        4   S (644)

项目详情


下载文件

下载适合您平台的文件。如果您不确定选择哪个,请了解有关安装包的更多信息。

源代码分发

edapy-0.4.1.tar.gz (16.0 kB 查看哈希值)

上传时间 源代码

构建分发

edapy-0.4.1-py3-none-any.whl (16.6 kB 查看哈希值)

上传时间 Python 3

由以下组织支持