使用scikitlearn创建数据集函数

2024-04-29 08:40:55 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我对Python非常陌生,我正在尝试使用scikit从我的计算机加载数据集。这就是我的代码的样子:

**whatever.py**

import numpy as np
import csv
from sklearn.datasets.base import Bunch

class Cortex_nuc:
    def cortex_nuclear():
        with open('C:/Users/User/Desktop/Data_Cortex_Nuclear4.csv') as csv_file:
            data_file = csv.reader(csv_file)
            temp = next(data_file)
            n_samples = int(float(temp[0]))
            n_features = int(float(temp[1]))
            data = np.empty((n_samples, n_features))
            target = np.empty((n_samples,), dtype=np.float64)

            for i, sample in enumerate(data_file):
                data[i] = np.asarray(sample[:-1], dtype=np.float64)
                target[i] = np.asarray(sample[-1], dtype=np.float64)

        return Bunch(data=data, target=target)


然后我将其导入到我的项目中:

from whatever import Cortex_nuc

然后我尝试将其保存到df中:

df = Cortex_nuc.cortex_nuclear()

顺便说一句,数据集看起来是这样的:

...

这只是数据集的一部分,否则它有77列和大约1000行

但我收到了一条错误信息,我似乎不明白为什么会发生这种情况。以下是错误消息:

IndexError                                Traceback (most recent call last)
<ipython-input-5-a4935f2c187f> in <module>
----> 1 df = Cortex_nuc.cortex_nuclear()

~\whatever.py in cortex_nuclear()
     20 
     21             for i, sample in enumerate(data_file):
---> 22                 data[i] = np.asarray(sample[:-1], dtype=np.float64)
     23                 target[i] = np.asarray(sample[-1], dtype=np.float64)
     24 

IndexError: index 0 is out of bounds for axis 0 with size 0

有人能帮我吗?谢谢


Tags: csvsampleinimporttargetdatanpcortex
1条回答
网友
1楼 · 发布于 2024-04-29 08:40:55

如果您想在Bunch对象中创建一个“sklearn like”数据集,您可能需要这样的内容:

import pandas as pd
import numpy as np
from sklearn.utils import Bunch

# For reproducing
from io import StringIO
csv_file = StringIO("""
target,A,B
0,0,0
1,0,1
1,1,0
0,1,1
""")

def load_xor(*, return_X_y=False):
    """Describe your data here."""
    _data_file = pd.read_csv(csv_file)
    _data = Bunch()

    _data["DESCR"] = load_xor.__doc__
    _data["data"] = _data_file[["A", "B"]].to_numpy(dtype=np.float64)
    _data["target"] = _data_file["target"].to_numpy(dtype=np.float64)
    _data["target_names"] = np.array(["false", "true"])
    _data["feature_names"] = np.array(list(_data_file.drop(["target"], axis=1)))

    if return_X_y:
        return _data.data, _data.target
    return _data

if __name__ == "__main__":
    # Return and unpack the `X`, `y` tuple
    X, y = load_xor(return_X_y=True)
    print(X, y)

这是因为sklearn.datasets通常返回具有特定属性/键的Bunch对象(有关说明,请参阅^{}文档的“返回”部分):

>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> dir(data)
['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']

相关问题 更多 >