有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java itext7如何在写入期间过滤渲染事件

我希望在将渲染文本事件写入输出文件时对其进行过滤。我有一个PDF,里面有一些我想过滤掉的文本。我发现我可以遍历文档一次,并确定要过滤的渲染事件的特征。现在我想复制源文档的页面,跳过一些RENDER_TEXT事件,这样文本就不会出现在目标文档中。我有一个IEventFilter,它将接受正确的事件。我只需要知道如何将这个过滤器放在文档编写器上

我们的目标是以议程格式从Google日历中创建一个PDF,并删除“created by:”和“Calendar:”行。这些行通常由3个渲染文本事件组成

我现在的代码如下。我发现,对于基线具有相同y坐标的所有RENDER_TEXT事件将标识我要删除的事件

import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import com.itextpdf.kernel.geom.LineSegment;
import com.itextpdf.kernel.geom.PageSize;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfPage;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.kernel.pdf.canvas.parser.EventType;
import com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor;
import com.itextpdf.kernel.pdf.canvas.parser.data.IEventData;
import com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo;
import com.itextpdf.kernel.pdf.canvas.parser.filter.IEventFilter;
import com.itextpdf.kernel.pdf.canvas.parser.listener.IEventListener;

public class Main {

    private static final Logger LOGGER = LogManager.getLogger();

    public static void main(String[] args) throws FileNotFoundException, IOException {
        final Path src = Paths.get("calendar_2018-08-04_2018-08-19.pdf");
        final Path dest = Paths.get("/home/jpschewe/Downloads/calendar_clean.pdf");

        final Main app = new Main(src, dest);

    }

    private Main(final Path src, final Path dest) throws FileNotFoundException, IOException {

        try (PdfDocument srcDoc = new PdfDocument(new PdfReader(src.toFile()));
                PdfDocument destDoc = new PdfDocument(new PdfWriter(dest.toFile()))) {
            final Rectangle pageSize = srcDoc.getFirstPage().getPageSize();

            for (int i = 1; i <= srcDoc.getNumberOfPages(); ++i) {
                PdfPage page = srcDoc.getPage(i);

                final GatherBaselines gatherBaselines = new GatherBaselines();
                final PdfCanvasProcessor processor = new PdfCanvasProcessor(gatherBaselines);
                processor.processPageContent(page);

                LOGGER.info("Filter baselines for page {} -> {}", i, gatherBaselines.baselinesToFilter);

                destDoc.setDefaultPageSize(new PageSize(pageSize));
                destDoc.addNewPage();
            }

        }
    }

    public class FilterEventsByBaseline implements IEventFilter {
        private final List<Float> baselinesToFilter;

        public FilterEventsByBaseline(final List<Float> baselinesToFilter) {
            this.baselinesToFilter = baselinesToFilter;
        }

        @Override
        public boolean accept(final IEventData data, final EventType type) {
            if (type.equals(EventType.RENDER_TEXT)) {
                final TextRenderInfo renderInfo = (TextRenderInfo) data;
                final LineSegment baseline = renderInfo.getBaseline();
                final float checkY = baseline.getStartPoint().get(1);

                final boolean filter = baselinesToFilter.stream().anyMatch(f -> Math.abs(checkY - f) < 1E-6);
                return !filter;
            }

            return true;

        }
    }

    public class GatherBaselines implements IEventListener {

        // need to store all baselines that are problems
        // the assumption is that all RENDER_TEXT operations with a baseline in the bad
        // list need to be filtered when copying pages
        private final List<Float> baselinesToFilter = new LinkedList<>();

        @Override
        public void eventOccurred(final IEventData data, final EventType type) {
            if (type.equals(EventType.RENDER_TEXT)) {
                final TextRenderInfo renderInfo = (TextRenderInfo) data;

                final String text = renderInfo.getText();
                final LineSegment baseline = renderInfo.getBaseline();
                if (null != text && (text.contains("Calendar:") || text.contains("Created by:"))) {
                    // index 1 is the y coordinate
                    baselinesToFilter.add(baseline.getStartPoint().get(1));
                }
            }

        }

        @Override
        public Set<EventType> getSupportedEvents() {
            return Collections.singleton(EventType.RENDER_TEXT);
        }

    }

}

多谢各位


共 (1) 个答案

  1. # 1 楼答案

    正如注释中所建议的,您可以使用来自this answerPdfCanvasEditor从内容流中过滤所需的操作。实际上,我稍微扩展了这个类,以便能够正确地支持'"文本绘制操作符。你可以找到那个类here

    就像在您的方法中一样,要清除的行是在第一次运行时确定的:我为此使用了一个RegexBasedLocationExtractionStrategy实例

    此后,在PdfCanvasEditor步骤中,在这些行上绘制文本的指令被更改为仅绘制空字符串

    不过,由于您检查的事件不会导致在此处绘制文本,而是更基本的运算符和操作数结构,因此确切的机制不是从IEventFilter派生出来的。但机制与你的方法相似

    try (PdfDocument pdfDocument = new PdfDocument(SOURCE_PDF_READER, TARGET_PDF_WRITER)) {
        List<Rectangle> triggerRectangles = new ArrayList<>();
    
        PdfCanvasEditor editor = new PdfCanvasEditor()
        {
            {
                Field field = PdfCanvasProcessor.class.getDeclaredField("textMatrix");
                field.setAccessible(true);
                textMatrixField = field;
            }
    
            @Override
            protected void nextOperation(PdfLiteral operator, List<PdfObject> operands) {
                try {
                    recentTextMatrix = (Matrix)textMatrixField.get(this);
                } catch (IllegalArgumentException | IllegalAccessException e) {
                    throw new RuntimeException(e);
                }
            }
    
            @Override
            protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
            {
                String operatorString = operator.toString();
    
                if (TEXT_SHOWING_OPERATORS.contains(operatorString))
                {
                    Matrix matrix = null;
                    try {
                        matrix = recentTextMatrix.multiply(getGraphicsState().getCtm());
                    } catch (IllegalArgumentException e) {
                        throw new RuntimeException(e);
                    }
                    float y = matrix.get(Matrix.I32);
                    if (triggerRectangles.stream().anyMatch(rect -> rect.getBottom() <= y && y <= rect.getTop())) {
                        if ("TJ".equals(operatorString))
                            operands.set(0, new PdfArray());
                        else
                            operands.set(operands.size() - 2, new PdfString(""));
                    }
                }
    
                super.write(processor, operator, operands);
            }
    
            final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
            final Field textMatrixField;
            Matrix recentTextMatrix;
        };
    
        for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
        {
            PdfPage page = pdfDocument.getPage(i);
            Set<PdfName> xobjectNames = page.getResources().getResourceNames(PdfName.XObject);
            for (PdfName xobjectName : xobjectNames) {
                PdfFormXObject xobject = page.getResources().getForm(xobjectName);
                byte[] content = xobject.getPdfObject().getBytes();
                PdfResources resources = xobject.getResources();
    
                RegexBasedLocationExtractionStrategy regexLocator = new RegexBasedLocationExtractionStrategy("Created by:|Calendar:");
                new PdfCanvasProcessor(regexLocator).processContent(content, resources);
                triggerRectangles.clear();
                triggerRectangles.addAll(regexLocator.getResultantLocations().stream().map(loc -> loc.getRectangle()).collect(Collectors.toSet()));
    
                PdfCanvas pdfCanvas = new PdfCanvas(new PdfStream(), resources, pdfDocument);
                editor.editContent(content, resources, pdfCanvas);
                xobject.getPdfObject().setData(pdfCanvas.getContentStream().getBytes());
            }
        }
    }
    

    EditPageContent测试testRemoveSpecificLinesCalendar


    小心,这是一个概念证明,特别是为OP的用例定制的:这里的PdfCanvasEditor仅用于检查和编辑每个页面的第一级表单XObject,因为从Google Calendar以议程格式创建的PDF包含表单XObject中的所有页面内容,而表单XObject又被绘制在页面内容流中。此外,文本应该与页面顶部平行