Python3 處理PDF之PyMuPDF 入門(mén)

這篇具有很好參考價(jià)值的文章主要介紹了Python3 處理PDF之PyMuPDF 入門(mén)。希望對(duì)大家有所幫助。如果存在錯(cuò)誤或未考慮完全的地方，請(qǐng)大家不吝賜教，您也可以點(diǎn)擊"舉報(bào)違法"按鈕提交疑問(wèn)。

PyMuPDF 簡(jiǎn)介

PyMuPDF是一個(gè)用于處理PDF文件的Python庫(kù)，它提供了豐富的功能來(lái)操作、分析和轉(zhuǎn)換PDF文檔。這個(gè)庫(kù)的設(shè)計(jì)目標(biāo)是提供一個(gè)簡(jiǎn)單易用的API,使得開(kāi)發(fā)者能夠輕松地在Python程序中實(shí)現(xiàn)PDF文件的各種操作。

PyMuPDF的主要特點(diǎn)如下：

跨平臺(tái)兼容性：PyMuPDF支持多種操作系統(tǒng)，如Windows、macOS和Linux,可以在這些平臺(tái)上運(yùn)行Python程序。
強(qiáng)大的PDF處理能力：PyMuPDF提供了豐富的功能來(lái)操作PDF文件，如讀取、寫(xiě)入、分割、合并、旋轉(zhuǎn)、裁剪等。此外，它還支持加密和解密PDF文檔，以及提取文本、圖像和元數(shù)據(jù)等信息。
易于使用：PyMuPDF的API設(shè)計(jì)簡(jiǎn)潔明了，易于學(xué)習(xí)和使用。開(kāi)發(fā)者可以通過(guò)簡(jiǎn)單的函數(shù)調(diào)用來(lái)實(shí)現(xiàn)各種PDF操作，而無(wú)需深入了解底層細(xì)節(jié)。

PyMuPDF 安裝及其依賴(lài)第三方框架

pip 安裝 PyMuPDF 模塊

pip install pymupdf

驗(yàn)證pymupdf 模塊是否安裝成功

import fitz
import PIL

# 打印pymupdf模塊:基本信息
from fitz import TextPage

print(fitz.__doc__)

PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.10 on win32 (64-bit).

PyMuPDF 依賴(lài)第三方框架?

當(dāng)使用Pixmap.pil_save()和 Pixmap.pil_tobytes() 需要?Pillow模塊

當(dāng)使用Document.subset_fonts()時(shí)需要? FontTools模塊

PyMuPDF 核心類(lèi)

在PyMuPDF 核心類(lèi)演示涉及類(lèi)

Python3 處理PDF之PyMuPDF 入門(mén),Python GUI,Python之降龍十八掌,pdf

?其他未使用到的其他類(lèi):Archive（檔案）、Colorspace(色彩空間對(duì)象)、DisplayList(顯示列表對(duì)象)、DocumentWriter(文檔編輯對(duì)象)、Identity(身份對(duì)象)、 IRect(長(zhǎng)方形對(duì)象)、linkDest(連接目的對(duì)象)、Matrix(矩陣對(duì)象)、Outline(大綱)、Quad(四邊形對(duì)象)、Shape(形狀對(duì)象)、 Story(章節(jié)對(duì)象)、TextPage(文本頁(yè)面對(duì)象)、TextWriter(文本寫(xiě)入對(duì)象)、Tools(工具類(lèi))、Xml(xml 文檔對(duì)象)

PyMuPDF 核心類(lèi)演示

加載PDF文件

# 加載pdf 文件
doc = fitz.open("E:\doc\opencv 4.1中文官方文檔v1.1版.pdf")

獲取Document 屬性和方法

# 獲取Document 文檔對(duì)象的屬性和方法
# 1、獲取pdf 頁(yè)數(shù)
pageCount = doc.page_count
print("pdf 頁(yè)數(shù)", pageCount)

# 2、獲取pdf 元數(shù)據(jù)
metaData = doc.metadata
print("pdf 元數(shù)據(jù):", metaData)

# 3、獲取pdf 目錄信息
toc = doc.get_toc()
print("pdf 目錄：", toc)

Page 屬性和方法

通過(guò)Page 對(duì)象實(shí)現(xiàn)以下功能:

? 您可以將頁(yè)面呈現(xiàn)為光柵或矢量（SVG）圖像，可以選擇縮放、旋轉(zhuǎn)、移動(dòng)或剪切頁(yè)面。

? 您可以提取多種格式的頁(yè)面文本和圖像，并搜索文本字符串。

Page 加載方法

page = doc.load_page(pno) # loads page number 'pno' of the document (0-based)
page = doc[pno] # the short form

Documnet 迭代器加載Page 方法

for page in doc:
    # do something with 'page'
    
# ... or read backwards
for page in reversed(doc):
    # do something with 'page'
    
# ... or even use 'slicing'
for page in doc.pages(start, stop, step):
    # do something with 'page'

# 獲取Page 頁(yè)面對(duì)象的屬性和方法
page = doc.load_page(1)  # 默認(rèn)加載第一頁(yè)
print("page 對(duì)象:", page)

檢查頁(yè)面的鏈接、批注或表單字段

# 1、獲取Page 頁(yè)面的鏈接、批注或表單字段
links = page.get_links()
for link in links:
    # 涉及Link 對(duì)象
    print("鏈接:", link)

annots = page.annots()
for annot in annots:
    # 涉及Annot 對(duì)象
    print("批注:", annot)

widgets = page.widgets()
for widget in widgets:
    # 涉及表單字段
    print("表單字段:", widget)

頁(yè)面展示/頁(yè)面圖像保存到文件中

# 2、Page 頁(yè)面-光柵圖像
pix = page.get_pixmap()
print("打印頁(yè)面圖像對(duì)象:", pix)
# 保存光柵圖像圖像,需要依賴(lài)第三方框架:Pillow
pix.pil_save("page-%i.png" % page.number)

Page.get_pixmap()提供了許多用于控制圖像的變體：分辨率、顏色空間（例如，生成灰度圖像或具有減色方案的圖像）、透明度、旋轉(zhuǎn)、鏡像、移位、剪切等。

Pixmap包含以下引用的許多方法和屬性。其中包括整數(shù)寬度、高度（每個(gè)像素）和跨距（一個(gè)水平圖像行的字節(jié)數(shù)）。屬性示例表示表示圖像數(shù)據(jù)的矩形字節(jié)區(qū)域（Python字節(jié)對(duì)象）。

溫馨提示:page.get_svg_image()創(chuàng)建頁(yè)面的矢量圖像。?

提取文本和圖像

# 3、Page 獲取文本\圖像\其他信息
# 溫馨提示:涉及TextPage 常量類(lèi)型定義
text = page.get_text("text")
print("指定頁(yè)面文本內(nèi)容:", text)

對(duì)opt使用以下字符串之一以獲取不同的格式：

"text"：（默認(rèn)）帶換行符的純文本。無(wú)格式、無(wú)文字位置詳細(xì)信息、無(wú)圖像- "blocks"：生成文本塊（段落）的列表- "words"：生成單詞列表（不包含空格的字符串）- "html"：創(chuàng)建頁(yè)面的完整視覺(jué)版本，包括任何圖像。這可以通過(guò)internet瀏覽器顯示- "dict"/"json"：與HTML相同的信息級(jí)別，但作為Python字典或resp.JSON字符串。- "rawdict"/"rawjson"："dict"/"json"的超級(jí)集合。它還提供諸如XML之類(lèi)的字符詳細(xì)信息。- "xhtml"：文本信息級(jí)別與文本版本相同，但包含圖像。- "xml"：不包含圖像，但包含每個(gè)文本字符的完整位置和字體信息。使用XML模塊進(jìn)行解釋。

搜索文本

# 4、Page 文本檢索
search = page.search_for("圖像的基本操作")
print("打印檢索文本的位置:", search)

提供一個(gè)矩形列表，每個(gè)矩形都包含一個(gè)字符串“mupdf”（不區(qū)分大小寫(xiě)）。

PDF操作?

PDF是唯一可以使用PyMuPDF修改的文檔類(lèi)型。其他文件類(lèi)型是只讀的。但是，您可以將任何文檔（包括圖像）轉(zhuǎn)換為PDF，然后將所有PyMuPDF功能應(yīng)用于轉(zhuǎn)換果,Document.convert_to_pdf()。

Document.save()始終將PDF以其當(dāng)前（可能已修改）狀態(tài)存儲(chǔ)在磁盤(pán)上。

通常，您可以選擇是保存到新文件，還是僅將修改附加到現(xiàn)有文件（“增量保存”），這通常要快得多。

# Document 操作PDF頁(yè)面
# 1、PDF 頁(yè)面刪除
# doc.delete_page(1)
# 1、PDF 頁(yè)面拷貝和移動(dòng)
doc.copy_page(1)  # 第一頁(yè)移動(dòng)最后一頁(yè),溫馨提示：移動(dòng)的頁(yè)面還在元PDF 文件中。
# 1、 PDF 插入頁(yè)面,  返回插入頁(yè)面對(duì)象
new_page = doc.new_page(pno=-1, width=595, height=842)
# 插入頁(yè)面, 設(shè)置文本
text = "你的文本"
point = fitz.Point(50, 50)  # 這是一個(gè)下x,y 二維坐標(biāo)系，在這個(gè)區(qū)域內(nèi)插入你的文本
new_page.insert_text(point, text, fontsize=20)
# 2、Document 保存
doc.save("opencv pdf文件調(diào)整.pdf")
# 3、Documemt 銷(xiāo)毀
doc.close()

PDF 刪除方法

Document.delete_page()
Document.delete_pages()

PDF移動(dòng)拷貝方法

Document.copy_page()
Document.fullcopy_page()
Document.move_page()

PDF插入Page 方法

Document.insert_page()
Document.new_page()

PyMuPDF 核心功能模塊封裝

PDF 分割

每一頁(yè)單獨(dú)保存為一個(gè)pdf

def split_per_page(input, output):
    if not os.path.exists(output):
        os.makedirs(output)
    
    doc = fitz.open(input)
    for page in range(doc.page_count):
        dst_doc = fitz.open()
        dst_doc.insert_pdf(doc,from_page=page,to_page=page)
        dst_doc.save(os.path.join(output,f'{page}.pdf'))
        dst_doc.close()
    doc.close()

# 把每一個(gè)頁(yè)面保存為一個(gè)pdf，并保存在test文件夾中
split_per_page("test.pdf","test")

范圍內(nèi)的頁(yè)面保存為pdf?

def split_range_page(input, output, range):
    if not os.path.exists(output):
        os.makedirs(output)
    doc = fitz.open(input)
    start = range[0] - 1
    end = range[1] - 1
    dst_doc = fitz.open()
    dst_doc.insert_pdf(doc, from_page=start, to_page=end)
    dst_doc.save(os.path.join(output,'range_page.pdf'))
    dst_doc.close()
    doc.close()

# 把1-10也保存為pdf，保存在test文件夾中
split_range_page('test.pdf','test', [1,10])

?任意的頁(yè)面保存為pdf

def split_selected_page(input, output, pages):
    if not os.path.exists(output):
        os.makedirs(output)
    
    doc = fitz.open(input)
    result = map(lambda x: x - 1, pages)
    doc.select(list(result))
    doc.save(os.path.join(output,'selected_pages.pdf'))
    doc.close()
    
# 把第一、三、八頁(yè)面保存為pdf，并保存在test文件夾中
split_selected_page('test.pdf','test',[1,3, 8])

PDF 合并

import fitz

doc_a = fitz.open("a.pdf") # open the 1st document
doc_b = fitz.open("b.pdf") # open the 2nd document

doc_a.insert_pdf(doc_b) # merge the docs
doc_a.save("a+b.pdf") # save the merged document with a new filename

# 把b.pdf合并到a.pdf，保存為a+b.pdf

PDF 中的圖片提取

import fitz

doc = fitz.open("test.pdf") # open a document

for page_index in range(len(doc)): # iterate over pdf pages
	page = doc[page_index] # get the page
	image_list = page.get_images()

	# print the number of images found on the page
	if image_list:
		print(f"Found {len(image_list)} images on page {page_index}")
	else:
		print("No images found on page", page_index)

	for image_index, img in enumerate(image_list, start=1): # enumerate the image list
		xref = img[0] # get the XREF of the image
		pix = fitz.Pixmap(doc, xref) # create a Pixmap

		if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
			pix = fitz.Pixmap(fitz.csRGB, pix)

		pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
		pix = None

PDF 保存為圖片

def covert2pic(zoom):
    doc = fitz.open("test.pdf")
    total = doc.page_count
    for pg in range(total):
        page = doc[pg]
        zoom = int(zoom)            #值越大，分辨率越高，文件越清晰
        rotate = int(0)
        
        trans = fitz.Matrix(zoom / 100.0, zoom / 100.0).prerotate(rotate)
        pm = page.get_pixmap(matrix=trans, alpha=False)
      
        lurl='.pdf/%s.jpg' % str(pg+1)
        pm.save(lurl)
    doc.close()

covert2pic(200)

PDF 添加水印

def add_watermark(input, watermark):
    doc = fitz.open(input)
    for page in doc:
        page.insert_image(page.bound(),filename=watermark, overlay=False)
    doc.save(os.path.join("test","watermark.pdf"))
    doc.close()
    
add_watermark("test.pdf","watermark.png")

PDF 加密

PDF加密有兩種形式

用戶(hù)加密，需要輸入密碼才能打開(kāi)pdf
擁有者加密，可以防止打印、復(fù)制、添加注釋、添加刪除頁(yè)面等功能

def encrypt_pdf():
    perm = int(
        fitz.PDF_PERM_ACCESSIBILITY # always use this
                | fitz.PDF_PERM_PRINT # permit printing
                | fitz.PDF_PERM_COPY # permit copying
                | fitz.PDF_PERM_ANNOTATE # permit annotations
    ) # 可以打印，復(fù)制，添加注釋
    owner_pass = "owner" # owner password
    user_pass = "user" # user password
    encrypt_meth = fitz.PDF_ENCRYPT_AES_256 # strongest algorithm
    doc = fitz.open("test.pdf") # empty pdf
    doc.save("encrypt.pdf",encryption=encrypt_meth,owner_pw=owner_pass,permissions=perm,user_pw=user_pass) # 同時(shí)使用
    
# 這兩個(gè)加密方式可以，單獨(dú)使用，也可以同時(shí)使用

# 單獨(dú)使用用戶(hù)加密
doc.save("encrypt.pdf",encryption=encrypt_meth,owner_pw=owner_pass)

PyMuPDF 在PyQT5 運(yùn)用

功能要求：在PyQT-5 展示pdf 文件.

效果展示:

Python3 處理PDF之PyMuPDF 入門(mén),Python GUI,Python之降龍十八掌,pdf

PyQT-5 UI效果展示和源文件

? Python3 處理PDF之PyMuPDF 入門(mén),Python GUI,Python之降龍十八掌,pdf

?pdfshow.ui

<?xml version="1.0" encoding="UTF-8"?>
<ui version="4.0">
 <class>Form</class>
 <widget class="QWidget" name="Form">
  <property name="geometry">
   <rect>
    <x>0</x>
    <y>0</y>
    <width>400</width>
    <height>300</height>
   </rect>
  </property>
  <property name="windowTitle">
   <string>Form</string>
  </property>
  <widget class="QLabel" name="label">
   <property name="geometry">
    <rect>
     <x>130</x>
     <y>70</y>
     <width>54</width>
     <height>12</height>
    </rect>
   </property>
   <property name="text">
    <string>PDF展示</string>
   </property>
  </widget>
 </widget>
 <resources/>
 <connections/>
</ui>

pdfshow.py 源碼

# -*- coding: utf-8 -*-

# Form implementation generated from reading ui file 'pdfshow.ui'
#
# Created by: PyQt5 UI code generator 5.15.9
#
# WARNING: Any manual changes made to this file will be lost when pyuic5 is
# run again.  Do not edit this file unless you know what you are doing.
import sys

from PyQt5 import QtCore, QtWidgets
from PyQt5.QtGui import QImage, QPixmap, QTransform
from PyQt5.QtWidgets import QWidget, QApplication
# 添加PDF 文件操作依賴(lài)
import fitz


class Ui_Form(QWidget):
    def __init__(self):
        super().__init__()
        self.label = None
        self.setupUi()
        self.image()

    def setupUi(self):
        self.setObjectName("Form")
        self.resize(400, 300)
        self.label = QtWidgets.QLabel(self)
        self.label.setGeometry(QtCore.QRect(130, 70, 54, 12))
        self.label.setObjectName("label")

        self.retranslateUi()
        QtCore.QMetaObject.connectSlotsByName(self)

    def retranslateUi(self):
        _translate = QtCore.QCoreApplication.translate
        self.setWindowTitle(_translate("Form", "Form"))
        self.label.setText(_translate("Form", "PDF展示"))

    def image(self):
        file = "E:\doc\opencv 4.1中文官方文檔v1.1版.pdf"
        # 打開(kāi)文件
        doc = fitz.open(file)
        # 讀取一頁(yè) 0代表第1頁(yè)
        page_one = doc.load_page(1)
        # 將第一頁(yè)轉(zhuǎn)換為Pixmap
        page_pixmap = page_one.get_pixmap()
        # 將Pixmap轉(zhuǎn)換為QImage
        image_format = QImage.Format_RGBA8888 if page_pixmap.alpha else QImage.Format_RGB888
        page_image = QImage(page_pixmap.samples, page_pixmap.width,
                            page_pixmap.height, page_pixmap.stride, image_format)
        width = page_image.width()
        height = page_image.height()
        # QImage 轉(zhuǎn)為QPixmap
        pix = QPixmap.fromImage(page_image)
        trans = QTransform()
        trans.rotate(90)  # 這里設(shè)置旋轉(zhuǎn)角度
        new = pix.transformed(trans)
        # 設(shè)置標(biāo)簽寬和高
        self.label.setFixedSize(400, 350)
        # 設(shè)置圖片大小自適應(yīng)標(biāo)簽
        self.label.setScaledContents(True)
        # 給標(biāo)簽設(shè)置圖像
        self.label.setPixmap(new)


if __name__ == '__main__':
    app = QApplication(sys.argv)

    w = Ui_Form()
    w.show()
    sys.exit(app.exec_())

解決思路

使用PyMuPDF模塊打開(kāi)文件。
讀取第一頁(yè)pdf文件第一頁(yè)。
從第一頁(yè)獲取圖像，是Pixmap類(lèi)。
使用PyQt5的QImage將上面的Pixmap轉(zhuǎn)換為QImage。
將QImage轉(zhuǎn)換為QPixmap。
將QPixmap設(shè)置給Label。

?PyMuPDF 預(yù)覽PDF? 文件

UI 原型設(shè)計(jì)：

Python3 處理PDF之PyMuPDF 入門(mén),Python GUI,Python之降龍十八掌,pdf

?Python 源碼

ImageListWidget:自定義QListWidget 列表組件，僅僅展示圖片模式

# _*_ coding : UTF-8_*_
# 開(kāi)發(fā)者 ： zhuozhiwengang
# 開(kāi)發(fā)時(shí)間 : 2023/8/6 0:54
# 文件名稱(chēng) : ImageListWidget
# 開(kāi)發(fā)工具 : PyCharm
import os

from PyQt5.QtCore import QSize
from PyQt5.QtGui import QIcon
from PyQt5.QtWidgets import QListWidget, QListWidgetItem, QListView, QWidget, QApplication, QHBoxLayout, QLabel, \
    QVBoxLayout


class ImageListWidget(QListWidget):
    def __init__(self):
        super(ImageListWidget, self).__init__()
        self.setFlow(QListView.Flow(1))#0: left to right,1: top to bottom
        self.setIconSize(QSize(150, 100))
        # 設(shè)置控件的列表視圖模式為IconMode
        self.setViewMode(QListWidget.IconMode)
        # 設(shè)置垂直布局
        self.setLayout(QVBoxLayout())

    def add_image_items(self, image_paths=[]):
        for i in range(len(image_paths)):
            # 創(chuàng)建縮略圖圖標(biāo)
            icon = QIcon()
            icon.addPixmap(image_paths[i], QIcon.Normal, QIcon.Off)
            # 創(chuàng)建QListWidgetItem對(duì)象，并設(shè)置圖標(biāo)和它的描述文字
            item = QListWidgetItem(icon, str(i))
            # 把item添加到listWidget中
            self.addItem(item)

ImageViewerWidget:自定義PDF預(yù)覽組件?

# _*_ coding : UTF-8_*_
# 開(kāi)發(fā)者 ： zhuozhiwengang
# 開(kāi)發(fā)時(shí)間 : 2023/8/6 0:55
# 文件名稱(chēng) : ImageViewerWidget
# 開(kāi)發(fā)工具 : PyCharm
import fitz
from PyQt5.QtGui import QPixmap, QImage
from PyQt5.QtWidgets import QWidget, QLabel, QHBoxLayout, QApplication, QVBoxLayout

from ImageListWidget import ImageListWidget


class ImageViewerWidget(QWidget):
    def __init__(self):
        super(QWidget, self).__init__()
        # 顯示控件
        self.list_widget = ImageListWidget()
        self.list_widget.setMinimumWidth(200)
        self.show_label = QLabel(self)
        self.show_label.setFixedSize(600, 400)
        self.image_paths = []
        self.currentImgIdx = 0
        self.currentImg = None

        # 水平布局
        self.layout = QVBoxLayout(self)
        self.layout.addWidget(self.show_label)
        self.layout.addWidget(self.list_widget)

        # 信號(hào)與連接
        self.list_widget.itemSelectionChanged.connect(self.loadImage)

    def load_from_paths(self, img_paths=[]):
        self.image_paths = img_paths
        self.list_widget.add_image_items(img_paths)

    def loadImage(self):
        self.currentImgIdx = self.list_widget.currentIndex().row()
        if self.currentImgIdx in range(len(self.image_paths)):
            self.currentImg = QPixmap(self.image_paths[self.currentImgIdx]).scaledToHeight(400)
            self.show_label.setPixmap(self.currentImg)


if __name__ == "__main__":
    import sys
    app = QApplication(sys.argv)

    # 圖像路徑
    file = "E:\doc\opencv 4.1中文官方文檔v1.1版.pdf"
    # 打開(kāi)文件
    doc = fitz.open(file)

    img_paths = []
    for i in range(0, doc.page_count):
        # 讀取一頁(yè) 0代表第1頁(yè)
        page = doc.load_page(i)
        # 將第一頁(yè)轉(zhuǎn)換為Pixmap
        page_pixmap = page.get_pixmap()
        # 將Pixmap轉(zhuǎn)換為QImage
        image_format = QImage.Format_RGBA8888 if page_pixmap.alpha else QImage.Format_RGB888
        page_image = QImage(page_pixmap.samples, page_pixmap.width,
                            page_pixmap.height, page_pixmap.stride, image_format)
        width = page_image.width()
        height = page_image.height()
        # QImage 轉(zhuǎn)為QPixmap
        pix = QPixmap.fromImage(page_image)
        img_paths.append(pix)

    # 顯示控件
    main_widget = ImageViewerWidget()
    main_widget.load_from_paths(img_paths)
    main_widget.setWindowTitle("ImageViewer")
    main_widget.show()

    # 應(yīng)用程序運(yùn)行
    sys.exit(app.exec_())

Python 效果展示

Python3 處理PDF之PyMuPDF 入門(mén),Python GUI,Python之降龍十八掌,pdf