resourceExhausterError:OOM分配十位数时

2024-04-25 22:54:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我自己做了一个AlexNet实现,用一个不太完全连接的层来分类102类花。我的训练集包含11000张图片,而验证和训练集则各有3000张图片。我用HDF5格式编写了这三个数据集并将它们存储在磁盘上。我重新加载了它们,并试图通过网络传递图像,使用8和75个时代的批次。但是,出现内存错误

我已经尝试过将批量大小减小到8,并将尺寸减小到400x400(原来是500x500),但是没有用

     import pandas as pd
import config
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from AlexNet import AlexNet
from preproce import ImageToArrayPreprocessor
from preproce import AspectAwarePreprocessor
from preproce import FCHeadNet
from preproce import HDF5datasetGenerator
from preproce import HDF5DatasetWriter
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
from tensorflow.python.keras.optimizers import RMSprop
from tensorflow.python.keras.optimizers import SGD
from tensorflow.python.keras.applications import VGG16
from tensorflow.python.keras.layers import Input
from tensorflow.python.keras.models import Model
from imutils import paths
import numpy as np
import argparse
import cv2
import os



"""aug = ImageDataGenerator(rotation_range=30, width_shift_range=0.1,
                          height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
                          horizontal_flip=True, fill_mode="nearest")"""
"""print("[INFO] loading images...")
trainPaths = list(paths.list_images(config.IMAGES_PATH))
dataset = pd.read_csv("train.csv")
labels = dataset.iloc[:, 1].values
le = LabelEncoder()
trainLabels = le.fit_transform(labels)

split = train_test_split(trainPaths, trainLabels,
                          test_size=config.NUM_TEST_IMAGES, stratify=trainLabels,
                          random_state=42)
(trainPaths, testPaths, trainLabels, testLabels) = split 

split = train_test_split(trainPaths, trainLabels,
                         test_size=config.NUM_VAL_IMAGES, stratify=trainLabels,random_state=42)
(trainPaths, valPaths, trainLabels, valLabels) = split

datasets = [ ("train", trainPaths, trainLabels, config.TRAIN_HDF5),
             ("val", valPaths, valLabels, config.VAL_HDF5),
             ("test", testPaths, testLabels, config.TEST_HDF5)]



for (dType, paths, labels, outputPath) in datasets: 
    print("[INFO] building {}...".format(outputPath))
    writer = HDF5DatasetWriter((len(paths), 500, 500, 3), outputPath) 
    for (i, (path, label)) in enumerate(zip(paths, labels)): 
        image = cv2.imread(path) 
        image = aap.preprocess(image) 
        writer.add([image], [label])
    writer.close()"""
#aap = AspectAwarePreprocessor(500, 500)
iap = ImageToArrayPreprocessor()
trainGen = HDF5DatasetGenerator(config.TRAIN_HDF5, 8,  preprocessors=[iap], classes=102) 
valGen = HDF5DatasetGenerator(config.VAL_HDF5, 8, preprocessors=[iap], classes=102)




print("[INFO] compiling model...")
opt = RMSprop(lr=0.001)
model=AlexNet.build(500,500,3,102)
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) 
print("[INFO] training head...")

model.fit_generator(
         trainGen.generator(),
         steps_per_epoch=trainGen.numImages // 8,
         validation_data=valGen.generator(),
         validation_steps=valGen.numImages // 8,
         epochs=75,
         max_queue_size=8 * 2, verbose=1)
print("[INFO] serializing model...")
model.save(config.MODEL_PATH, overwrite=True) 
trainGen.close()
valGen.close()

tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-08-23 00:19:47.336560: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:01:00.0 totalMemory: 4.00GiB freeMemory: 3.30GiB 2019-08-23 00:19:47.342432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2019-08-23 00:19:47.900540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-08-23 00:19:47.904687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2019-08-23 00:19:47.907033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2019-08-23 00:19:47.909380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3007 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-08-23 00:19:48.550001: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:49.089904: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:49.629533: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:50.067994: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. 2019-08-23 00:19:50.523258: W tensorflow/core/framework/allocator.cc:124] Allocation of 822083584 exceeds 10% of system memory. Epoch 1/75 2019-08-23 00:20:14.632764: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally 2019-08-23 00:20:16.325917: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.410374: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 836.38MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.650565: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 429.27MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.716695: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.22GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.733003: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 637.52MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.782250: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 844.88MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:16.792756: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 429.27MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-08-23 00:20:25.135977: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 784.00MiB. Current allocation summary follows. 2019-08-23 00:20:25.143913: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):
Total Chunks: 104, Chunks in use: 99. 26.0KiB allocated for chunks. 24.8KiB in use in bin. 452B client-requested in use in bin. 2019-08-23 00:20:25.150353: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512):
Total Chunks: 16, Chunks in use: 14. 8.0KiB allocated for chunks. 7.0KiB in use in bin. 5.3KiB client-requested in use in bin. 2019-08-23 00:20:25.160812: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1024): Total Chunks: 49, Chunks in use: 49. 61.3KiB allocated for chunks. 61.3KiB in use in bin. 60.1KiB client-requested in use in bin. 2019-08-23 00:20:25.169944: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2048): Total Chunks: 4, Chunks in use: 4. 13.0KiB allocated for chunks. 13.0KiB in use in bin. 12.8KiB client-requested in use in bin. 2019-08-23 00:20:25.182025: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4096): Total Chunks: 1, Chunks in use: 0. 6.3KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.192454: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8192): Total Chunks: 1, Chunks in use: 0. 15.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.200847: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16384):
Total Chunks: 9, Chunks in use: 9. 144.8KiB allocated for chunks. 144.8KiB in use in bin. 144.0KiB client-requested in use in bin. 2019-08-23 00:20:25.209817: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (32768):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.219192: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (65536):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.228194: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (131072):
Total Chunks: 9, Chunks in use: 9. 1.17MiB allocated for chunks. 1.17MiB in use in bin. 1.16MiB client-requested in use in bin. 2019-08-23 00:20:25.236088: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (262144):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.245435: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (524288):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.254114: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (1048576): Total Chunks: 8, Chunks in use: 7. 12.25MiB allocated for chunks. 11.22MiB in use in bin. 10.91MiB client-requested in use in bin. 2019-08-23 00:20:25.264209: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (2097152):
Total Chunks: 14, Chunks in use: 14. 42.09MiB allocated for chunks. 42.09MiB in use in bin. 42.09MiB client-requested in use in bin. 2019-08-23 00:20:25.273799: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (4194304):
Total Chunks: 13, Chunks in use: 13. 80.41MiB allocated for chunks. 80.41MiB in use in bin. 77.91MiB client-requested in use in bin. 2019-08-23 00:20:25.285089: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (8388608):
Total Chunks: 13, Chunks in use: 13. 141.14MiB allocated for chunks. 141.14MiB in use in bin. 136.45MiB client-requested in use in bin. 2019-08-23 00:20:25.298520: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (16777216):
Total Chunks: 4, Chunks in use: 4. 112.98MiB allocated for chunks. 112.98MiB in use in bin. 112.98MiB client-requested in use in bin. 2019-08-23 00:20:25.306979: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (33554432):
Total Chunks: 4, Chunks in use: 4. 183.11MiB allocated for chunks. 183.11MiB in use in bin. 183.11MiB client-requested in use in bin. 2019-08-23 00:20:25.315121: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (67108864):
Total Chunks: 1, Chunks in use: 0. 82.18MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.322194: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2019-08-23 00:20:25.331550: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456): Total Chunks: 3, Chunks in use: 3. 2.30GiB allocated for chunks. 2.30GiB in use in bin. 2.30GiB client-requested in use in bin. 2019-08-23 00:20:25.342419: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 784.00MiB was 256.00MiB, Chunk State: tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 2.87GiB 2019-08-23 00:20:50.049508: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats: Limit:
3153697177 InUse: 3086482944 MaxInUse:
3153574400 NumAllocs: 388 MaxAllocSize:
822083584

2019-08-23 00:20:50.061236: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **************************************************************************************************__ 2019-08-23 00:20:50.066546: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[50176,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "train.py", line 80, in max_queue_size=8 * 2, verbose=1) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1426, in fit_generator initial_epoch=initial_epoch) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\engine\training_generator.py", line 191, in model_iteration batch_outs = batch_function(*batch_data) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1191, in train_on_batch outputs = self._fit_function(ins) # pylint: disable=not-callable File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\keras\backend.py", line 3076, in call run_metadata=self.run_metadata) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\client\session.py", line 1439, in call run_metadata_ptr) File "C:\Users\aleem\Anaconda3\envs\tensorflowf\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[50176,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node training/RMSprop/gradients/loss/kernel/Regularizer_5/Square_grad/Mul_1}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node ConstantFoldingCtrl/loss/activation_6_loss/broadcast_weights/assert_broadcastable/AssertGuard/Switch_0}}]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Tags: incoreimportforbinusetensorflowcommon
1条回答
网友
1楼 · 发布于 2024-04-25 22:54:24

这是因为GPU内存不能自由分配用于训练,这可能是由于内存中的数据集过载(如果不是成批处理的话)。但是您已经使用了fit_generator,因此我们可以排除这一点,因为它为批量培训提供数据,同时并行运行生成数据。在

解决方案是检查哪个进程正在利用你的GPU。如果您使用的是nvidia GPU,您可以通过nvidia-smi检查进程使用GPU,或者您也可以尝试PS -fA | grep python。这将显示哪个进程正在运行并使用GPU。只需从PID列中获取进程ID并通过命令kill -9 PID终止进程。重新运行训练,这次你的GPU是免费的。我也面临同样的问题,清除GPU对我有帮助。在

  • 注意-所有命令都要在终端中运行。在

相关问题 更多 >