How to extract text from a PDF file via python?

I'm trying to extract the text included in this PDF file using Python . I'm using the PyPDF2 package (version 1.27.2), and have the following script:

import PyPDF2 with open("sample.pdf", "rb") as pdf_file: read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.pages[0] page_content = page.extractText() print(page_content)

When I run the code, I get the following output which is different from that included in the PDF document:

 ! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4 5 ' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &) %

How can I extract the text as is in the PDF document? 4,022 7 7 gold badges 51 51 silver badges 93 93 bronze badges asked Jan 17, 2016 at 11:16 Simplicity Simplicity 48.6k 101 101 gold badges 263 263 silver badges 392 392 bronze badges

Copy the text using a good PDF viewer - Adobe's canonical Acrobat Reader, if possible. Do you get the same result? The difference is not that the text is different, but the font is - the character codes map to other values. Not all PDFs contain the correct data to restore this.

Commented Jan 17, 2016 at 11:51 I tried another document and it worked. Yes, it seems the issue is with the PDF itself Commented Jan 17, 2016 at 13:11

That PDF contains a character CMap table, so the restrictions and work-arounds discussed in this thread are is relevant - stackoverflow.com/questions/4203414/….

Commented Jan 17, 2016 at 21:34

The PDF indeed contains a correct CMAP so it is trivial to convert the ad hoc character mapping to plain text. However, it takes additional processing to retrieve the correct order of text. Mac OS X's Quartz PDF renderer is a nasty piece of work! In its original rendering order I get "m T’h iuss iisn ga tosam fopllloew DalFo dnogc wumithe ntht eI tutorial". Only after sorting by x coordinates I get a far more likely correct result: "This is a sample PDF document I’m using to follow along with the tutorial".

Commented Jan 25, 2016 at 20:15 PyPDF2 adds random whitespaces between/in words. very hard to process. Commented Jun 17, 2022 at 11:12

35 Answers 35

I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika raw = parser.from_file('sample.pdf') print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed.

5,546 4 4 gold badges 18 18 silver badges 37 37 bronze badges answered Feb 7, 2018 at 21:43 9,204 4 4 gold badges 26 26 silver badges 41 41 bronze badges

I tested pypdf2, tika and tried and failed to install textract and pdftotext. Pypdf2 returned 99 words while tika returned all 858 words from my test invoice. So I ended up going with tika.

Commented Jun 19, 2018 at 9:11 I keep getting a "RuntimeError: Unable to start Tika server" error. Commented Oct 16, 2018 at 12:39 If you need to run this on all the PDF files in a directory (recursively), take this script Commented Apr 19, 2019 at 10:28

for who is having the "Unable to start Tika server" error, I solved installing the last version of Java as suggested here, which I did on Mac Os X with brew following this answer

Commented Oct 8, 2019 at 14:51

It downloads a tika-server.jar 76 MB file into C:\Users\User\AppData\Local\Temp . Is there a way to make this permanent if I clean temp later? It also requires a JAVA vm installed, is that right?

Commented Nov 15, 2019 at 12:30

pypdf recently improved a lot. Depending on the data, it is on-par or better than pdfminer.six .

pymupdf / tika / PDFium are better than pypdf, but the difference became rather small - (mostly when to set a new line). The core part is that they are way faster. But they are not pure-Python which can mean that you cannot execute it. And some might have too restrictive licenses so that you may not use it.

Have a look at the benchmark. This benchmark mainly considers English texts, but also German ones. It does not include:

Anything special regarding tables (just that the text is there, not about the formatting)
Arabic test (RTL-languages)
Mathematical formulas.

That means if your use-case requires those points, you might perceive the quality differently.

Having said that, the results from November 2022:

Quality

Speed

pypdf

I became the maintainer of pypdf and PyPDF2 in 2022! 😁 The community improved the text extraction a lot in 2022. Give it a try :-)

First, install it:

pip install pypdf

And then use it:

from pypdf import PdfReader reader = PdfReader("example.pdf") text = "" for page in reader.pages: text += page.extract_text() + "\n"

Please note that those packages are not maintained:

PyPDF2, PyPDF3, PyPDF4
pdfminer (without .six)

pymupdf

import fitz # install using: pip install PyMuPDF with fitz.open("my.pdf") as doc: text = "" for page in doc: text += page.get_text() print(text)

Other PDF libraries

pikepdf does not support text extraction (source)

answered Aug 21, 2020 at 7:02 Martin Thoma Martin Thoma 134k 170 170 gold badges 661 661 silver badges 1k 1k bronze badges

However, there seems to be a problem with the order of the text from the PDF. Intuitively the text would read from top to bottom and left to right, but here it seem to show up in another order

Commented Mar 20, 2021 at 19:45 Except, it occasionally just can't find the text in a page. Commented Sep 21, 2021 at 8:26

@Raf If you have an example PDF, please go ahead and create an issue: github.com/pymupdf/PyMuPDF/issues - the developer behin it is pretty active

Commented Sep 21, 2021 at 10:01 This is the most light-weight answer I've seen so far. No java server necessary! Commented Nov 2, 2021 at 14:59 This is the latest working solution as of 23 Jan 2022. Commented Jan 23, 2022 at 13:20

http://textract.readthedocs.io/en/latest/
https://github.com/deanmalmgren/textract

It supports many types of files including PDFs

import textract text = textract.process("path/to/file.extension")

answered Nov 12, 2016 at 10:55 Jakobovski Jakobovski 3,380 1 1 gold badge 32 32 silver badges 41 41 bronze badges Works for PDFs, epubs, etc - processes PDFs that even PDFMiner fails on. Commented Feb 7, 2017 at 1:57 how to use it in aws lambda , I tried this but , import error occured fro textract Commented Feb 27, 2018 at 7:17 textract is a wrapper for Poppler:pdftotext (among others). Commented Apr 17, 2018 at 0:21

@ArunKumar: To use anything in AWS Lambda that's not built-in, you have to include it and all extra dependencies, in your bundle.

Commented Jun 6, 2018 at 15:58 textract seems to be dead (source). Use either pdfminer.six directly or pymupdf Commented Aug 21, 2020 at 7:13

Look at this code for PyPDF2

import PyPDF2 pdf_file = open('sample.pdf', 'rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) page = read_pdf.getPage(0) page_content = page.extractText() print page_content.encode('utf-8')

!"#$%#$%&%$&'()*%+,-%./01'*23%4 5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&) %

Using the same code to read a pdf from 201308FCR.pdf .The output is normal.

def extractText(self): """ Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. :return: a unicode string object. """

134k 170 170 gold badges 661 661 silver badges 1k 1k bronze badges answered Jan 20, 2016 at 4:00 4,494 2 2 gold badges 23 23 silver badges 19 19 bronze badges

@VineeshTP: Are you getting anything for page_content? If yes, then see if it helps by using a different encoding other than (utf-8)

Commented Jul 14, 2019 at 22:17 Best library I found for reading the pdf using python is 'tika' Commented Jul 15, 2019 at 6:38 201308FCR.pdf not found. Commented Apr 5, 2020 at 0:33 @Matin Thoma is it possible to preserve the format, when extracting, say python code from a PDF? Commented Jan 24, 2023 at 14:04

After trying textract (which seemed to have too many dependencies) and pypdf2 (which could not extract text from the pdfs I tested with) and tika (which was too slow) I ended up using pdftotext from xpdf (as already suggested in another answer) and just called the binary from python directly (you may need to adapt the path to pdftotext):

import os, subprocess SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__)) args = ["/usr/local/bin/pdftotext", '-enc', 'UTF-8', "<>/my-pdf.pdf".format(SCRIPT_DIR), '-'] res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) output = res.stdout.decode('utf-8')

There is pdftotext which does basically the same but this assumes pdftotext in /usr/local/bin whereas I am using this in AWS lambda and wanted to use it from the current directory.

Btw: For using this on lambda you need to put the binary and the dependency to libstdc++.so into your lambda function. I personally needed to compile xpdf. As instructions for this would blow up this answer I put them on my personal blog.

answered Mar 13, 2018 at 20:30 hansaplast hansaplast 11.5k 2 2 gold badges 65 65 silver badges 81 81 bronze badges

Oh my god, it works!! Finally, a solution that extracts the text in the correct order! I want to hug you for this answer! (Or if you don't like hugs, here's a virtual coffee/beer/. )

Commented Nov 27, 2018 at 10:20 glad it helped! Upvoting gives the same sensation as hugging, so I'm fine! Commented Nov 28, 2018 at 6:47 simple . gr8 out of box thinking! Commented Aug 13, 2019 at 5:03 Please give PyPDF2 another chance. We've improved it a lot :-) Commented Dec 20, 2022 at 22:52

I've try many Python PDF converters, and I like to update this review. Tika is one of the best. But PyMuPDF is a good news from @ehsaneha user.

I did a code to compare them in: https://github.com/erfelipe/PDFtextExtraction I hope to help you.

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

from tika import parser raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf") raw = str(raw) safe_text = raw.encode('utf-8', errors='ignore') safe_text = str(safe_text).replace("\n", "").replace("\\", "") print('--- safe text ---' ) print( safe_text )

answered Mar 1, 2019 at 1:12 461 4 4 silver badges 14 14 bronze badges special thanks for .encode('utf-8', errors='ignore') Commented Mar 24, 2019 at 7:50 AttributeError: module 'os' has no attribute 'setsid' Commented Feb 22, 2020 at 6:50

this worked for me, when opening the file in 'rb' mode with open('../path/to/pdf','rb') as pdf: raw = str(parser.from_file(pdf)) text = raw.encode('utf-8', errors='ignore')

Commented Mar 31, 2021 at 16:42

You may want to use time proved xPDF and derived tools to extract text instead as pyPDF2 seems to have various issues with the text extraction still.

The long answer is that there are lot of variations how a text is encoded inside PDF and that it may require to decoded PDF string itself, then may need to map with CMAP, then may need to analyze distance between words and letters etc.

In case the PDF is damaged (i.e. displaying the correct text but when copying it gives garbage) and you really need to extract text, then you may want to consider converting PDF into image (using ImageMagik) and then use Tesseract to get text from image using OCR.

answered Jan 18, 2016 at 8:42 2,858 20 20 silver badges 24 24 bronze badges

-1 because the OP is asking for reading pdfs in Python, and although there is an xpdf wrapper for python it is poorly maintained.

Commented Dec 1, 2019 at 8:20 You might want to give PyPDF2 another shot (also mind the capitalization) Commented Dec 20, 2022 at 22:53

It's good because it can keep the layout of the original PDF.

It's written in Java but I have added a Gateway to support Python.

Sample code:

from py4j.java_gateway import JavaGateway gw = JavaGateway() result = gw.entry_point.strip('samples/bus.pdf') # result is a dict of < # 'success': 'true' or 'false', # 'payload': pdf file content if 'success' is 'true' # 'error': error message if 'success' is 'false' # >print result['payload']

enter image description here

Sample output from PDFLayoutTextStripper:

You can see more details here Stripper with Python

answered May 7, 2019 at 1:54 24.6k 6 6 gold badges 61 61 silver badges 49 49 bronze badges

The best feature of this library is definitely its ability to (mostly) preserve the layout. The worst is that you need to standup a gateway service in Java.

Commented Feb 9 at 13:27

The below code is a solution to the question in Python 3. Before running the code, make sure you have installed the pypdf library in your environment. If not installed, open the command prompt and run the following command (instead of pip you might need pip3 ):

pip install pypdf --upgrade

Solution Code using pypdf > 3.0.0:

import pypdf reader = PyPDF2.PdfReader('sample.pdf') for page in reader.pages: print(page.extract_text())

134k 170 170 gold badges 661 661 silver badges 1k 1k bronze badges answered May 23, 2018 at 13:38 Steffi Keran Rani J Steffi Keran Rani J 4,057 4 4 gold badges 38 38 silver badges 60 60 bronze badges How would u save all the content in one text file and use it for further analysis Commented Aug 24, 2018 at 7:45

PyPDF2 in some cases ignores the white spaces and makes the result text a mess, but I use PyMuPDF and I'm really satisfied you can use this link for more info

answered Aug 4, 2018 at 16:38 1,805 15 15 silver badges 10 10 bronze badges

pymupdf is the best solution I observed, does not require additional C++ libraries like pdftotext or java like tika

Commented Oct 4, 2019 at 13:56

pymypdf is really the best solution, no additional server or libraries, and it works with file where PyPDF2 PypDF3 PyPDF4 retrive empty string of text. many thanks!

Commented Feb 26, 2020 at 13:45

to install pymupdf, run pip install pymupdf==1.16.16 . Using this specific version because today the newest version (17) is not working. I opted for pymupdf because it extracts text wrapping fields in new line char \n . So I'm extracting the text from pdf to a string with pymupdf and then I'm using my_extracted_text.splitlines() to get the text splitted in lines, into a list.

Commented Apr 9, 2020 at 13:53 PyMuPDF was really surprising. Thanks. Commented May 4, 2020 at 20:08 Page doesn't exist Commented Sep 22, 2020 at 17:28

pdftotext is the best and simplest one! pdftotext also reserves the structure as well.

I tried PyPDF2, PDFMiner and a few others but none of them gave a satisfactory result.

answered Apr 3, 2019 at 12:16 267 1 1 gold badge 4 4 silver badges 12 12 bronze badges

Message as follows when installing pdf2text, Collecting PDFMiner (from pdf2text) , so I don't understand this answer now.

Commented Sep 24, 2019 at 6:01 pdf2text and pdftotext are different. You can use the link from the answer. Commented Nov 5, 2019 at 6:04 OK. That's a little bit confusing. Commented Nov 6, 2019 at 3:33 You might want to give PyPDF2 another shot. We've improved it a lot. Commented Dec 20, 2022 at 22:53

In 2020 the solutions above were not working for the particular pdf I was working with. Below is what did the trick. I am on Windows 10 and Python 3.8

#pip install pdfminer.six import io from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage def convert_pdf_to_txt(path): '''Convert pdf content from a file path to text :path the file path ''' rsrcmgr = PDFResourceManager() codec = 'utf-8' laparams = LAParams() with io.StringIO() as retstr: with TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) as device: with open(path, 'rb') as fp: interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos = set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): interpreter.process_page(page) return retstr.getvalue() if __name__ == "__main__": print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

answered Jul 31, 2020 at 11:18 3,764 1 1 gold badge 20 20 silver badges 22 22 bronze badges

Excellent answer. There's an anaconda install as well. I was installed and had extracted text in < 5 minutes. [note: tika also worked, but pdfminer.six was much faster)

Commented Sep 21, 2020 at 1:33 You are a lifesaver! Commented Oct 21, 2020 at 12:30 In 2023, 3 lines of pypdf do the same: extract text with pypdf Commented Mar 22, 2023 at 9:25

In 2024, many libraries can extract the text, but depending upon the original structure of the PDF -- particularly the use of tables -- the result will vary dramatically. 3 lines of code does not imply that the output from a given PDF will be coherent or useful.

Commented Feb 9 at 13:30

I tested Jortega's code above, and it really struggled with data in tables, especially when there was a blank cell.

Commented Feb 9 at 14:00

pdfplumber is one of the better libraries to read and extract data from pdf. It also provides ways to read table data and after struggling with a lot of such libraries, pdfplumber worked best for me.

Mind you, it works best for machine-written pdf and not scanned pdf.

import pdfplumber with pdfplumber.open(r'D:\examplepdf.pdf') as pdf: first_page = pdf.pages[0] print(first_page.extract_text())

13.3k 7 7 gold badges 85 85 silver badges 98 98 bronze badges answered Oct 19, 2021 at 14:04 Aklank Jain Aklank Jain 1,052 1 1 gold badge 15 15 silver badges 22 22 bronze badges

This is nice, but I have a question on the format of the output. I want to save the result of the print into a pandas dataframe. Is that possible?

Commented Jan 19, 2022 at 15:02

I've got a better work around than OCR and to maintain the page alignment while extracting the text from a PDF. Should be of help:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = open(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos=set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text text= convert_pdf_to_txt('test.pdf') print(text)

answered Mar 16, 2020 at 12:38 717 6 6 silver badges 18 18 bronze badges

Nb. The latest version no longer uses the codec arg . I fixed this by removing it i.e. device = TextConverter(rsrcmgr, retstr, laparams=laparams)

Commented Jul 10, 2020 at 12:56

Multi - page pdf can be extracted as text at single stretch instead of giving individual page number as argument using below code

import PyPDF2 import collections pdf_file = open('samples.pdf', 'rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() c = collections.Counter(range(number_of_pages)) for i in c: page = read_pdf.getPage(i) page_content = page.extractText() print page_content.encode('utf-8')

answered Jun 22, 2018 at 10:13 85 1 1 silver badge 1 1 bronze badge Only problem here the content of new page overwrites the last one Commented Aug 24, 2018 at 7:44

PDF to text keeps text format indentation, doesn't matter if you have tables.

answered Dec 6, 2017 at 23:20 Máxima Alekz Máxima Alekz 582 10 10 silver badges 23 23 bronze badges

As of 2021 I would like to recommend pdfreader due to the fact that PyPDF2/3 seems to be troublesome now and tika is actually written in java and needs a jre in the background. pdfreader is pythonic, currently well maintained and has extensive documentation here.

Installation as usual: pip install pdfreader

Short example of usage:

from pdfreader import PDFDocument, SimplePDFViewer # get raw document fd = open(file_name, "rb") doc = PDFDocument(fd) # there is an iterator for pages page_one = next(doc.pages()) all_pages = [p for p in doc.pages()] # and even a viewer fd = open(file_name, "rb") viewer = SimplePDFViewer(fd)

answered Aug 12, 2021 at 7:23 harmonica141 harmonica141 1,459 2 2 gold badges 25 25 silver badges 27 27 bronze badges

On a note, installing pdfreader on Windows requires Microsoft C++ Build Tools installed on your system, whilst the answer below recommending pymupdf installed directly using pip without any extra requirement.

Commented Sep 21, 2021 at 6:14 I couldnt use it on jupyter notebook, keeps crashing the kernel Commented Mar 6, 2022 at 20:31

If wanting to extract text from a table, I've found tabula to be easily implemented, accurate, and fast:

to get a pandas dataframe:

import tabula df = tabula.read_pdf('your.pdf') df

By default, it ignores page content outside of the table. So far, I've only tested on a single-page, single-table file, but there are kwargs to accommodate multiple pages and/or multiple tables.

pip install tabula-py # or conda install -c conda-forge tabula-py

answered Sep 21, 2020 at 2:12 2,199 2 2 gold badges 15 15 silver badges 27 27 bronze badges

tabula is impressive. Of all the solutions I tested from this page, this is the only one that was able to maintain the order of rows and fields. There are still a few adjustments needed for complex tables, but since the output seems reproductible from one table to the other and is stored in a pandas.DataFrame it is easy to correct.

Commented Feb 1, 2021 at 16:15 Also check Camelot. Commented Feb 1, 2021 at 17:25

Here is the simplest code for extracting text

code:

# importing required modules import PyPDF2 # creating a pdf file object pdfFileObj = open('filename.pdf', 'rb') # creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # printing number of pages in pdf file print(pdfReader.numPages) # creating a page object pageObj = pdfReader.getPage(5) # extracting text from page print(pageObj.extractText()) # closing the pdf file object pdfFileObj.close()

379 3 3 silver badges 10 10 bronze badges answered Jun 14, 2018 at 7:12 93 1 1 silver badge 12 12 bronze badges Recomending 'tika' Commented Aug 30, 2019 at 13:33 PyPDF2 / PyPDF3 / PyPDF4 are all dead. Use pymupdf Commented Aug 21, 2020 at 7:16

You can simply do this using pytessaract and OpenCV. Refer the following code. You can get more details from this article.

import os from PIL import Image from pdf2image import convert_from_path import pytesseract filePath = ‘021-DO-YOU-WONDER-ABOUT-RAIN-SNOW-SLEET-AND-HAIL-Free-Childrens-Book-By-Monkey-Pen.pdf’ doc = convert_from_path(filePath) path, fileName = os.path.split(filePath) fileBaseName, fileExtension = os.path.splitext(fileName) for page_number, page_data in enumerate(doc): txt = pytesseract.image_to_string(page_data).encode(“utf-8”) print(“Page # <> — <>”.format(str(page_number),txt))

answered Aug 5, 2021 at 17:34 SandunAmarathunga SandunAmarathunga 109 1 1 gold badge 1 1 silver badge 8 8 bronze badges

To convert pdf to text :

 def pdf_to_text(): from pdfminer.high_level import extract_text text = extract_text('test.pdf') print(text)

answered Jan 3, 2021 at 19:31 575 11 11 silver badges 17 17 bronze badges Order is not proper. Commented Jun 19, 2021 at 9:15

This Question so far has 35 answers and not one seems to mention that
the text extracted is the true text from the Questioners PDF page. Nor explained WHY.

For comparison here is the RAW PDF code when decompressed (inflated, under the surface, by the PDF viewer). Thus in some cases this is what is extractable The native "Literal" plain text.

 BT 50 0 0 50 0 0 Tm /TT2 1 Tf [ (!) -0.3 (") -0.4 (#) -0.5 ($) -0.1 (%) -0.1 (#) -0.5 ($) -0.1 (%) -0.1 (&%) -0.1 ($) -0.1 (&') 0.2 (\() -0.4 (\)) -0.5 (*) 0.4 (%) -0.1 (+) 0.4 (,) -0.2 (-) -0.5 (%) -0.1 (.) -0.4 (/) -0.3 (0) 0.1 (1) -0.4 (') 0.2 (*) 0.4 (2) -0.4 (3%) -0.1 (4) ] TJ ET

BT 50 0 0 50 0 0 Tm /TT2 1 Tf (5) Tj ET

BT 50 0 0 50 0 0 Tm /TT2 1 Tf [ (') 0.2 (%) -0.1 (1) -0.4 ($) -0.1 (#) -0.5 (2) -0.4 (6) 0.3 (%) -0.1 (3/) -0.3 (%) -0.1 (7) -0.2 (/) -0.3 (\)) -0.5 (\)) -0.5 (/) -0.3 (8) 0.2 (%) -0.1 (&\)) -0.5 (/) -0.3 (2) -0.4 (6) 0.3 (%) -0.1 (8) 0.2 (#) -0.5 (3") -0.4 (%) -0.1 (3") -0.4 (*) 0.4 (%) -0.1 (31) -0.4 (3/) -0.3 (9) 0.4 (#) -0.5 (&\)) ] TJ ET

If you study PDF you know that the body text is the bracketed text from above thus we can expect to extract this raw text coding.

! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
'% 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)

Compare that with the OP observation

So my bad mistake, In my extraction I missed that final (%)

So what was the real problem with "different from that included in the PDF document:"

Answer

When raw page text is placed in the page it is binary encoded as numeric data. Which to our human eyes looks like the above separate ANSI letters, but they are encoded in a PDF for simplicity as single bytes. There is a secondary PDFtoText "ToUnicode" process where the extractor has to convert the short codes into conventional CALIBRI UTF-8 screen pixels.
Here is that table

24 beginbfrange                         endbfrange

enter image description here

Most notable in the longer Unicode is, in this case, fistly one on one with more conventional ANSI codes to . However, there is one odd boy out so we can see that is ANSI 5 and Unicode 2019 is ’ so that single 5 on its own, has been isolated as a separate entry.

Also what about that odd % on its own at the end that I missed, why might that be ? Well look up % and it is hex which in a PDF is counted as a comment but in this case converts to \U+0009 very oddly that is (Character Tabulation) which is usually discarded when building a PDF. Thus usually has no physical width.

So using the ToUnicode values in a PDFtoText conversion we can expect it to be post extraction re-coded into

This is a sample PDF document I ’ m using to follow along with the tutorial

enter image description here

But there seem to be other issues with that source !! (remember all those % characters have no width ?)

Solution

enter image description here

We need to fix the file and one very simple fix is replace the tabs with spaces by change 2 bytes from to , then resave to rebuild without error. Now extraction should be improved, but do convert with an ANSI to UTF-8 extraction such as.

pdftotext -layout -enc UTF-8 sample-fixed.pdf -

enter image description here

answered Feb 9 at 22:12 11.1k 4 4 gold badges 19 19 silver badges 48 48 bronze badges

I am adding code to accomplish this: It is working fine for me:

# This works in python 3 # required python packages # tabula-py==1.0.0 # PyPDF2==1.26.0 # Pillow==4.0.0 # pdfminer.six==20170720 import os import shutil import warnings from io import StringIO import requests import tabula from PIL import Image from PyPDF2 import PdfFileWriter, PdfFileReader from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage warnings.filterwarnings("ignore") def download_file(url): local_filename = url.split('/')[-1] local_filename = local_filename.replace("%20", "_") r = requests.get(url, stream=True) print(r) with open(local_filename, 'wb') as f: shutil.copyfileobj(r.raw, f) return local_filename class PDFExtractor(): def __init__(self, url): self.url = url # Downloading File in local def break_pdf(self, filename, start_page=-1, end_page=-1): pdf_reader = PdfFileReader(open(filename, "rb")) # Reading each pdf one by one total_pages = pdf_reader.numPages if start_page == -1: start_page = 0 elif start_page < 1 or start_page >total_pages: return "Start Page Selection Is Wrong" else: start_page = start_page - 1 if end_page == -1: end_page = total_pages elif end_page < 1 or end_page >total_pages - 1: return "End Page Selection Is Wrong" else: end_page = end_page for i in range(start_page, end_page): output = PdfFileWriter() output.addPage(pdf_reader.getPage(i)) with open(str(i + 1) + "_" + filename, "wb") as outputStream: output.write(outputStream) def extract_text_algo_1(self, file): pdf_reader = PdfFileReader(open(file, 'rb')) # creating a page object pageObj = pdf_reader.getPage(0) # extracting extract_text from page text = pageObj.extractText() text = text.replace("\n", "").replace("\t", "") return text def extract_text_algo_2(self, file): pdfResourceManager = PDFResourceManager() retstr = StringIO() la_params = LAParams() device = TextConverter(pdfResourceManager, retstr, codec='utf-8', laparams=la_params) fp = open(file, 'rb') interpreter = PDFPageInterpreter(pdfResourceManager, device) password = "" max_pages = 0 caching = True page_num = set() for page in PDFPage.get_pages(fp, page_num, maxpages=max_pages, password=password, caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue() text = text.replace("\t", "").replace("\n", "") fp.close() device.close() retstr.close() return text def extract_text(self, file): text1 = self.extract_text_algo_1(file) text2 = self.extract_text_algo_2(file) if len(text2) > len(str(text1)): return text2 else: return text1 def extarct_table(self, file): # Read pdf into DataFrame try: df = tabula.read_pdf(file, output_format="csv") except: print("Error Reading Table") return print("\nPrinting Table Content: \n", df) print("\nDone Printing Table Content\n") def tiff_header_for_CCITT(self, width, height, img_size, CCITT_group=4): tiff_header_struct = ' total_pages: return "Start Page Selection Is Wrong" else: start_page = start_page - 1 if end_page == -1: end_page = total_pages elif end_page < 1 or end_page >total_pages - 1: return "End Page Selection Is Wrong" else: end_page = end_page for i in range(start_page, end_page): # creating a page based filename file = str(i + 1) + "_" + downloaded_file print("\nStarting to Read Page: ", i + 1, "\n -----------===-------------") file_text = self.extract_text(file) print(file_text) self.extract_image(file) self.extarct_table(file) os.remove(file) print("Stopped Reading Page: ", i + 1, "\n -----------===-------------") os.remove(downloaded_file) # I have tested on these 3 pdf files # url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Healthcare-January-2017.pdf" url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sample_Test.pdf" # url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sazerac_FS_2017_06_30%20Annual.pdf" # creating the instance of class pdf_extractor = PDFExtractor(url) # Getting desired data out pdf_extractor.read_pages(15, 23)