在Django中,如何将上载的pdf文件转换为图像文件并保存到数据库中的相应列?

2024-04-19 10:13:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在创建一个HTML模板来显示pdf文件的封面(第一页或用户可以选择一个)。我想Django创建封面图像自动无需额外上传

pdf文件使用Django Modelform上传。下面是我的代码结构

models.py

class Pdffile(models.Model):
    pdf = models.FileField(upload_to='pdfdirectory/')
    filename = models.CharField(max_length=20)
    pagenumforcover = models.IntegerField()
    coverpage = models.FileField(upload_to='coverdirectory/')

form.py

class PdffileForm(ModelForm):
    class Meta:
        model = Pdffile
        fields = (
            'pdf',
            'filename',
            'pagenumforcover',
        )

views.py

def upload(request):
    if request.method == 'POST':
        form = PdffileForm(request.POST, request.FILES)
        if form.is_valid():
            form.save()
            return redirect('pdffilelist')
    else:
        form = PdffileForm()
    return render(request, "uploadform.html", {'form': form})


def pdfcover(request, pk):
    thispdf = get_object_or_404(Pdffile, pk=pk)

    return render(request, 'pdfcover.html', {'thispdf': thispdf})

在“pdfcover.html”中,我想使用Django模板语言,以便为不同上传的pdf文件呈现不同的html。这就是为什么我要将图像文件保存到与pdf文件相同的列中

我不熟悉Python,不熟悉Django,显然也不熟悉堆栈溢出。我试过pdf2image和PyPDF2,我相信它们都可以工作,但我就是找不到正确的代码。如果你们能启发我,我会很感激的


Tags: 文件djangopyformreturnpdfmodelsrequest
1条回答
网友
1楼 · 发布于 2024-04-19 10:13:09

pdf2image包中有一个名为convert_from_path的函数

这是该函数的每个参数在包中的说明

Parameters:
            pdf_path -> Path to the PDF that you want to convert
            dpi -> Image quality in DPI (default 200)
            output_folder -> Write the resulting images to a folder (instead of directly in memory)
            first_page -> First page to process
            last_page -> Last page to process before stopping
            fmt -> Output image format
            jpegopt -> jpeg options `quality`, `progressive`, and `optimize` (only for jpeg format)
            thread_count -> How many threads we are allowed to spawn for processing
            userpw -> PDF's password
            use_cropbox -> Use cropbox instead of mediabox
            strict -> When a Syntax Error is thrown, it will be raised as an Exception
            transparent -> Output with a transparent background instead of a white one.
            single_file -> Uses the -singlefile option from pdftoppm/pdftocairo
            output_file -> What is the output filename or generator
            poppler_path -> Path to look for poppler binaries
            grayscale -> Output grayscale image(s)
            size -> Size of the resulting image(s), uses the Pillow (width, height) standard
            paths_only -> Don't load image(s), return paths instead (requires output_folder)
            use_pdftocairo -> Use pdftocairo instead of pdftoppm, may help performance
            timeout -> Raise PDFPopplerTimeoutError after the given time

因为convert_from_path被设计成能够将pdf中的每一页转换成图像,所以函数返回一个图像对象数组

如果设置output_folder参数,则每个图像将从基本目录保存到该位置output_folder在这种情况下必须是完整路径,例如'path/from/root/to/output_folder'。如果不进行设置,则图像在转换时不会保存,仅保存在内存中

默认情况下,如果不设置output_file参数,函数将生成一个随机格式的文件名,如0a15a918-59ba-4f15-90f0-2ed5fbd0c36c-1.ext。虽然如果确实设置了文件名,因为此文件名用于转换多个pdf页面,但如果output_file'file_name',则每个文件的命名将从'file_name0001-1.ext'开始

请注意,如果设置output_fileoutput_folder并尝试转换两个不同的pdf,则第二个pdf将覆盖第一个pdf的图像文件(如果它们位于同一目录中)



下面是一些在问题中围绕您的代码建模的代码。此代码假定您已安装pdf2image

我在pdf文件字段中添加了一个内置的验证器,因为如果上传的不是pdf,否则代码就会崩溃

validators=[FileExtensionValidator(allowed_extensions=['pdf'])]

我还为上传目录和文件格式创建了三个常量。如果您需要更改其中任何一个,那么代码的其余部分可以保持不变

COVER_PAGE_DIRECTORY = 'coverdirectory/'
PDF_DIRECTORY = 'pdfdirectory/'
COVER_PAGE_FORMAT = 'jpg'

另外,我假设您有保存文件的默认设置

settings.py

MEDIA_URL = '/media/'
MEDIA_ROOT = os.path.join(BASE_DIR, 'media')

models.py

from django.core.validators import FileExtensionValidator
from django.db.models.signals import post_save
from pdf2image import convert_from_path
from django.conf import settings
import os


COVER_PAGE_DIRECTORY = 'coverdirectory/'
PDF_DIRECTORY = 'pdfdirectory/'
COVER_PAGE_FORMAT = 'jpg'

# this function is used to rename the pdf to the name specified by filename field
def set_pdf_file_name(instance, filename):
    return os.path.join(PDF_DIRECTORY, '{}.pdf'.format(instance.filename))

# not used in this example
def set_cover_file_name(instance, filename):
    return os.path.join(COVER_PAGE_DIRECTORY, '{}.{}'.format(instance.filename, COVER_PAGE_FORMAT))

class Pdffile(models.Model):
    # validator checks file is pdf when form submitted
    pdf = models.FileField(
        upload_to=set_pdf_file_name, 
        validators=[FileExtensionValidator(allowed_extensions=['pdf'])]
        )
    filename = models.CharField(max_length=20)
    pagenumforcover = models.IntegerField()
    coverpage = models.FileField(upload_to=set_cover_file_name)

def convert_pdf_to_image(sender, instance, created, **kwargs):
    if created:
        # check if COVER_PAGE_DIRECTORY exists, create it if it doesn't
        # have to do this because of setting coverpage attribute of instance programmatically
        cover_page_dir = os.path.join(settings.MEDIA_ROOT, COVER_PAGE_DIRECTORY)

        if not os.path.exists(cover_page_dir):
            os.mkdir(cover_page_dir)

        # convert page cover (in this case) to jpg and save
        cover_page_image = convert_from_path(
            pdf_path=instance.pdf.path,
            dpi=200, 
            first_page=instance.pagenumforcover, 
            last_page=instance.pagenumforcover, 
            fmt=COVER_PAGE_FORMAT, 
            output_folder=cover_page_dir,
            )[0]

        # get name of pdf_file 
        pdf_filename, extension = os.path.splitext(os.path.basename(instance.pdf.name))
        new_cover_page_path = '{}.{}'.format(os.path.join(cover_page_dir, pdf_filename), COVER_PAGE_FORMAT)
        # rename the file that was saved to be the same as the pdf file
        os.rename(cover_page_image.filename, new_cover_page_path)
        # get the relative path to the cover page to store in model
        new_cover_page_path_relative = '{}.{}'.format(os.path.join(COVER_PAGE_DIRECTORY, pdf_filename), COVER_PAGE_FORMAT)
        instance.coverpage = new_cover_page_path_relative

        # call save on the model instance to update database record
        instance.save()

post_save.connect(convert_pdf_to_image, sender=Pdffile)

convert_pdf_to_image是在Pdffile模型的post_save信号上运行的函数。它会在您的PdffileForm保存在上载视图中后运行,以便我们可以从保存的pdf文件创建封面图像文件

cover_page_image = convert_from_path(
            pdf_path=instance.pdf.path,
            dpi=200, 
            first_page=instance.pagenumforcover, 
            last_page=instance.pagenumforcover, 
            fmt=COVER_PAGE_FORMAT, 
            output_folder=cover_page_dir,
            )[0]

更改dpi将更改图像的质量。为了只转换一个页面,first_pagelast_page参数是相同的。因为结果是一个数组,所以在本例中,我们获取cover_page_image内列表中的第一个也是唯一一个元素

对上载视图的微小更改

views.py

def upload(request):

    form = PdffileForm()

    if request.method == 'POST':
        form = PdffileForm(request.POST, request.FILES)
        # if form is not valid then form data will be sent back to view to show error message
        if form.is_valid():
            form.save()
            return redirect('pdffilelist')

    return render(request, "uploadform.html", {'form': form})

我不知道你的upload.html文件是什么样子的,但是我使用了下面的代码

upload.html

<h1>Upload PDF</h1>

<form method="POST" enctype="multipart/form-data">
    {% csrf_token %}
    {{ form.as_p }}
    <button type="submit">Upload</button>
</form>

以pdf为例

example pdf

透过表格上载

upload form

生成的数据库记录

db record

上传后生成的文件位置

file directory with images



最后说明:

因为文件字段有确保现有文件不会被覆盖的代码,所以

# get name of pdf_file 
pdf_filename, extension = os.path.splitext(os.path.basename(instance.pdf.name))
new_cover_page_path = '{}.{}'.format(os.path.join(cover_page_dir, pdf_filename), COVER_PAGE_FORMAT)
# rename file to be the same as the pdf file
os.rename(cover_page_image.filename, new_cover_page_path)
# get the relative path to the cover page to store in model
new_cover_page_path_relative = '{}.{}'.format(os.path.join(COVER_PAGE_DIRECTORY, pdf_filename), COVER_PAGE_FORMAT)
instance.coverpage = new_cover_page_path_relative

确保使用pdf文件字段文件名来命名封面,因为它几乎是完全唯一的

duplicate filenames

相关问题 更多 >