Python - 处理 PDF

  • 简述

    Python 可以读取 PDF 文件并在从中提取文本后打印出内容。为此,我们必须首先安装所需的模块,即PyPDF2. 下面是安装模块的命令。你应该已经在你的 python 环境中安装了 pip。
    
    pip install pypdf2
    
    成功安装此模块后,我们可以使用模块中提供的方法读取 PDF 文件。
    
    import PyPDF2
    pdfName = 'path\Tutorialspoint.pdf'
    read_pdf = PyPDF2.PdfFileReader(pdfName)
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    print page_content
    
    当我们运行上述程序时,我们得到以下输出 -
    
    CAINIAOYA originated from the idea that there exists a class of readers who respond better
    to online content and prefer to learn new skills at their own pace from the comforts of their 
    drawing rooms.
     
    The journey commenced with a single tutorial on HTML in 2006 and elated by the response 
    it generated, we worked our way to adding fresh tutorials to our repository which now 
    proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming
    languages to web designing to academics and much more.
    
  • 阅读多页

    要阅读具有多页的 pdf 并使用页码打印每一页,我们使用带有 getPageNumber() 函数的 a 循环。在下面的示例中,我们的 PDF 文件有两页。内容打印在两个单独的页面标题下。
    
    import PyPDF2
    pdfName = 'Path\Tutorialspoint2.pdf'
    read_pdf = PyPDF2.PdfFileReader(pdfName)
    for i in xrange(read_pdf.getNumPages()):
        page = read_pdf.getPage(i)
        print 'Page No - ' + str(1+read_pdf.getPageNumber(page))
        page_content = page.extractText()
        print page_content
    
    当我们运行上述程序时,我们得到以下输出 -
    
    Page No - 1
    CAINIAOYA originated from the idea that there exists a class of readers who respond better to
    online content and prefer to learn new skills at their own pace from the comforts of their drawing 
    rooms. 
    Page No - 2
     
    The journey commenced with a single tutorial on HTML in 2006 and elated by the response it 
    generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts 
    a wealth of tutorials and allied articles on topics ranging from p
    rogramming languages to web 
    designing to academics and much more.