Extracting information from PDFs

Sarah Pelizzon
2 min readApr 30, 2021

Moving forward on the learning Python journey, this week's exercise was to manipulate PDF files with Python in Google Colab.

Photo created by jcomp — www.freepik.com

It went very smoothly, the PDFs can be stored in google drive, or GitHub. It needs to be a place where you can establish a link. This link will allow Google Colab to find the way to the files. The library used here was PDF Plumber. Some of the steps related to extracting the text can be seen in the picture below.

Then it is all about pointing out which information you are looking for.

In the model used for this exercise, the the daily revenue per sales representative were always in the same area of the report. Other information such as number of visitors in the website, or new leads, or customers was obtained, and from there, one can monitor the marketing information for that company.

This functionality can save so much time on activities where one needs to do manual checks of a great number of PDF files, if a standard layout is presented. I can think about quite many different ways to use that solution.

From searching for recipes from your secret family recipes database that can be prepared with a specific ingredient when you have that amazing bunch of berries in your fridge and you are looking for some inspiration to cook them before they expire. Another application, now on a business level, would be to aggregate details from sales reports coming from different sales representatives to consolidate total sales of the day; also, business trips reimbursement formularies; or, obtaining different information from formularies in projects with a higher level of manual control. Which other applications you could think about? Share with me in the comments. :)

--

--