As Python developers, we often come across situations where we need to extract data from various sources and formats such as PDFs, CSV, HTML and more. In this article, we will take an in-depth look at parsing data from PDF files and introduce some Python packages that are useful for parsing other data formats.
Parse PDF files in Python
PDF is a standard file format used extensively for sharing and printing documents. Unfortunately, it is not the easiest format when it comes to data extraction, due to its complex structure. Fortunately, Python provides several libraries that can help us extract data from PDF files, such as PyPDF2 and PDFMiner.
PyPDF2
PyPDF2 is a pure Python library developed as a PDF toolkit. It is able to extract document information, split documents page by page, merge pages, crop pages and encrypt and decrypt PDF files.
Here is a simple example of using PyPDF2 to extract text from a PDF file:
import PyPDF2
def extract_text_from_pdf(file_path):
pdf_file_obj = open(file_path, 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)
text = ""
for page_num in range(pdf_reader.numPages):
page_obj = pdf_reader.getPage(page_num)
text += page_obj.extractText()
pdf_file_obj.close()
return text
print(extract_text_from_pdf('example.pdf'))
# Further information is available in our PyPDF2 Post.
PDFMiner
While PyPDF2 is a great tool for basic PDF processing tasks, it doesn't always do a good job when it comes to extracting text that retains its original layout. This is where PDFMiner comes into play. It focuses on retrieving and analyzing text data from PDF files.
Here is a simple example of how to use PDFMiner to extract text:
from pdfminer.high_level import extract_text
def extract_text_from_pdf(file_path):
text = extract_text(file_path)
return text
print(extract_text_from_pdf('example.pdf'))
PDFQuery
PDFQuery is a lightweight Python library that uses a combination of XML and jQuery syntax to parse PDFs. It is especially useful if you know the exact location of the data in a PDF file that you want to extract.
import pdfquery
def extract_data_from_pdf(file_path):
pdf = pdfquery.PDFQuery(file_path)
pdf.load()
label = pdf.pq('LTTextLineHorizontal:contains("Your Label")')
return label
data = extract_data_from_pdf('Beispiel.pdf')
print(data)
Tabula-py
If your PDFs consist of tables, Tabula-py is the ideal library for extraction. It is a simple wrapper for Tabula that allows you to read tables in a PDF into DataFrame objects.
import tabula
def extract_table_from_pdf(file_path):
df = tabula.read_pdf(file_path, pages='all')
return df
df = extract_table_from_pdf('Beispiel.pdf')
print(df)
PDFBox
PDFBox is a Java library which is useful for PDF-related tasks and also provides a Python wrapper, python-pdfbox. Although the functionality is somewhat limited compared to the original Java library, it can extract text, metadata and images.
import pdfbox
def extract_text_from_pdf(file_path):
p = pdfbox.PDFBox()
text = p.extract_text(file_path)
return text
text = extract_text_from_pdf('Beispiel.pdf')
print(text)
Slate
Slate builds on PDFMiner to provide a simpler API for text extraction from PDF files. However, it has not been maintained for several years, so it may not work optimally with newer versions of Python.
import slate
with open('Beispiel.pdf', 'rb') as f:
document = slate.PDF(f)
text = " ".join(document)
print(text)
PDFPlumber
This library provides extensive functionality for extracting text, tables and even visual elements from PDFs. It builds on PDFMiner and provides a more user-friendly API.
import pdfplumber
def extract_text_from_pdf(file_path):
with pdfplumber.open(file_path) as pdf:
first_page = pdf.pages[0]
text = first_page.extract_text()
return text
print(extract_text_from_pdf('Beispiel.pdf'))
Each of these libraries has its own strengths and weaknesses, and the best choice depends on the details of the task at hand. Please carefully evaluate your requirements and the PDF files you are working with when choosing a library.
Konfuzio SDK
Konfuzio is a sophisticated software development kit (SDK) that helps parse data from complex and unstructured documents, including PDFs. Konfuzio's strength lies in its ability to use machine learning for information extraction. It is not just a text extractor - it can understand the context and relationships in your document.
Here it goes to the PDF Tutorials with the Python SDK from Konfuzio.
Other data parsers in Python
Beyond PDFs, Python provides a wealth of libraries for parsing various data formats. Here are a few examples.
CSV parsing: pandas
The pandas library is a powerful data manipulation tool that also simplifies parsing CSV files:
import pandas as pd
def parse_csv(file_path):
df = pd.read_csv(file_path)
return df
df = parse_csv('example.csv')
print(df.head())
This script reads a CSV file into a pandas DataFrame, which is a 2-dimensional labeled data structure with possibly different column types.
HTML parsing: Beautiful Soup
Beautiful Soup is a Python library used for the purpose of web scraping to pull data from HTML and XML files.
from bs4 import BeautifulSoup
import requests
def parse_html(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
soup = parse_html('https://www.example.com')
print(soup.prettify())
This script fetches the HTML content of a web page and parses it into a BeautifulSoup object that you can browse to extract data.
JSON parsing: json
Python's standard library includes the json module, which allows you to encode and parse JSON data.
import json
def parse_json(json_string):
data = json.loads(json_string)
return data
data = parse_json('{"key": "value"}')
print(data)
This script parses a JSON string into a Python dictionary.
Conclusion
In this article, we have only scratched the surface of data parsing in Python. Depending on your specific needs and the complexity of your data, you may need to consider other libraries and tools. However, the packages and examples provided here should give you a good starting point for most common data parsing tasks.