Dealing with non-structured data is always interesting, especially when it means building solution to parse PDFs. Many companies and individuals use PDF files daily, and PDFs are used to distribute all kind of information: from simple text, to complex tables and diagrams.
Over the years, there have been multiple approaches to convert (non-structured) data from PDFs into a structured data (e.g. a JSON file). With the advancement of Large Language Models (LLMs), we have an additional tool in in our arsenal to assist us with this task. In fact, we can even go to the extent to say that LLMs made it so easy to extract unstructured information, that many of the problems that were present in this field (e.g. understand complex nested tables) are now considered solved. This does not mean that there hasn’t been a sprawling of new OSS tools and libraries that try to make it as easy as possible to carry out these tasks.
Problem
In this context, we had a project at hand that required us to use LLMs to extract data from a PDF. For more details, address a specific problem in the daily routine of tech recruiters. They had a manual process of reading PDFs and entering this information into a CRM. This process took a long time, and the recruiters were wasting more time copying and pasting information than actually focusing on their best task: connecting opportunities with people. Based on this problem, we created this workflow to automate this task.
The goal of this article is to describe how we solved this problem and the learnings we gained during this development.
Solution
We need to be able to read Outlook email attachments, save them in a safe place, trigger the parser solution, and post the extracted information on our CRM platform. Also, we have to manage the status of all attachments already parsed and monitor the workflow execution in production.
The diagram below describes all the technologies and their interactions that we used in the development of this project. Azure as the cloud provider, with the following services:
- Blob storage – File storage system, which will be the place where we’ll store all files.
- The Function App – a serverless script execution service, will be responsible for running our script that performs all logical operations
- CosmosDB – to manage the status and save the information of the attachments.
The core of the application was written in Python, communicating with OpenAI’s LLM Models for parsing the submitted PDFs, and connecting with the CRM Platform (Byner) via Salesforce API, which allows the creation of jobs in the tool.
On the Office 365 side, we are using out-of-the-box functionality templates from Power Automate to push email attachments to Blob storage upon receipt. For code versioning and CI/CD, we use GitHub and GitHub Actions. And for monitoring, we connect our Python application with the Slack API, notifying us of each application execution failure.
We made some decisions in our project aimed at minimizing complexity and costs. For instance:
- The integration between Blob storage and the Function App was expedited through a default trigger definition that Azure provides for communication between both components. It’s cheaper and easier to manage than other messaging solutions on Azure, such as Event Hubs.
- CosmosDB fits very well in terms of maintainability; the database scales itself, and in our current scenario, we can even utilize the free tier.
- For LLM, OpenAI is currently the most renowned service. Although it’s not free, we can optimize and minimize costs. For basic operations, such as checking whether a PDF file is a Job Description, we can use
gpt-3.5
models, which are about 60 times cheaper thangpt-4
models. - As our main communication app is Slack, we integrated our Python script with the Slack API, sending alerts for each new issue found during our execution. On the trigger side, we were also able to connect our Power Automate with Slack, adding a new flow that runs for every error on that side.
- GitHub and GitHub Actions cover the CI/CD process very well. We can connect our Azure Function App with our repository on GitHub and define routines for deploying and merging to the production environment.
So let’s get hands on
Setting up the infrastructure using Terraform
Terraform it’s the main solution for Infrastructure As Code (IaC), and has already proven to us how efficient it is. For instance, this was evident when we had to switch our Subscription key and rebuild the infrastructure in a new Resource Group.
We are storing all Terraform scripts inside a folder called ‘infra’, created in the root directory. Check out the structure:
infra
├── locals.tf # Define locals like project name and prefix
├── main.tf # Create RG, Storage Account and Modules
├── providers.tf # Define Azure Providers
├── terraform.tf # Add the connection with Terraform Cloud
├── variables.tf # Define all tokens
├── modules
│ ├── cosmos-db
│ │ ├── main.tf # Create CosmosDB Account and Database
│ │ ├── output.tf # Output the DB Connection String
│ │ ├── providers.tf # Define Azure Providers
│ │ └── variables.tf # Define DB vars
│ └── function-app
│ ├── logs.tf # Add Azure Application Insights and Monitor
│ ├── main.tf # Create the service plan and the Function App
│ ├── outputs.tf # Output the Azure Function App name
│ ├── providers.tf # Define Azure Providers
│ └── variables.tf # Define all vars used in our application
Here are some highlights to share regarding the Cosmos DB configuration. You can use the code below to include it in your infrastructure. We need to create both azurerm_cosmosdb_account
and azurerm_cosmosdb_mongo_database
.
Note that, as we are using CosmosDB with MongoDB, we need to specify certain capabilities such as mongoEnableDocLevelTTL
, MongoDBv3.4
, EnableMongo
, and kind
. We also need to define a geo_location
. In this case, we are using a single location, but if replication is required, you can add more locations.
The offer_type
is set to “Standard”, which fits our current scenario, and from a security perspective, within the Standard offer type, we can restrict access to the database to a specific range of IPs.
resource "azurerm_cosmosdb_account" "db" {
name = "your-component-name"
location = var.location
resource_group_name = var.resource_group_name
offer_type = "Standard"
kind = "MongoDB"
tags = var.tags
enable_free_tier = true
ip_range_filter = join(",", var.ips_allowed)
capabilities {
name = "EnableAggregationPipeline"
}
capabilities {
name = "mongoEnableDocLevelTTL"
}
capabilities {
name = "MongoDBv3.4"
}
capabilities {
name = "EnableMongo"
}
consistency_policy {
consistency_level = "Session"
}
geo_location {
location = var.location
failover_priority = 0
}
}
resource "azurerm_cosmosdb_mongo_database" "db_mongo" {
name = "your-db-name"
resource_group_name = azurerm_cosmosdb_account.db.resource_group_name
account_name = azurerm_cosmosdb_account.db.name
}
Another important component is the Function App. The following script will create the infrastructure for the Function App in your Resource Group on Azure. It involves defining the azurerm_service_plan
and azurerm_linux_function_app
, along with adding the necessary configurations. In our scenario, we want to run the application on a Linux machine and use Python as the programming language.
resource "azurerm_service_plan" "jda_service_plan" {
name = "${var.azure_function_name}-app-service-plan"
resource_group_name = var.resource_group_name
location = var.location
os_type = "Linux"
sku_name = "Y1"
tags = var.tags
}
resource "azurerm_linux_function_app" "job_description_function_app" {
name = "${var.azure_function_name}-func-app"
resource_group_name = var.resource_group_name
location = var.location
tags = var.tags
storage_account_name = var.storage_account_name
storage_account_access_key = var.storage_account_primary_access_key
https_only = true
service_plan_id = azurerm_service_plan.jda_service_plan.id
app_settings = {
"FUNCTIONS_WORKER_RUNTIME" = "python"
"FUNCTIONS_EXTENSION_VERSION" = "~4"
"OPENAI_API_KEY" = var.open_ai_key
"GPT4_MODEL_NAME" = var.open_ai_model
"OPEN_AI_TEMPERATURE" = var.open_ai_temperature
"CRM_USERNAME" = var.crm_username
"CRM_PASSWORD" = var.crm_password
"CRM_SECURITY_TOKEN" = var.crm_security_token
"CRM_CLIENT_ID" = var.crm_client_id
"CRM_CLIENT_SECRET" = var.crm_client_secret
"DB_CONNECTION_STRING" = var.db_connection_secret
"SLACK_TOKEN" = var.slack_token
"SLACK_CHANNEL" = var.slack_channel
}
site_config {
# Define Python Version
application_stack {
python_version = 3.11
}
# CORS is required to test the Azure Function via Azure Portal
cors {
allowed_origins = ["https://portal.azure.com"]
support_credentials = true
}
}
identity {
type = "SystemAssigned"
}
}
For more details, please check out the reference page, and you can see all the definitions and other examples of CosmosDB and Function App.
Creating the Azure Function with Python
Azure Functions offers a serverless architecture that simplifies your development process by reducing the amount of code you need to write, minimizing infrastructure management, and cutting down costs. You can write and host applications using different programming languages, but for this solution, we choose Python.
All that we need to define to build a Function App is a main python file with the triggers definitions, and the requirements.txt
to install the platform dependencies. This next file has a blob_trigger
definition, adding the path name that we’ll track. By default, for each new file created under this path, will trigger this function.
import logging
import azure.functions as func
from io import BytesIO
# This is the function of our project
from application.main import run_workflow
app = func.FunctionApp()
@app.blob_trigger(arg_name="myblob",
path="dbc",
connection="AzureWebJobsStorage")
def BlobTriggerTest(myblob: func.InputStream):
# Read the File
blob_io = BytesIO(myblob.read())
blob_print(myblob)
# Call our main function to run the workflow
run_workflow(blob_io, myblob.uri, "dbc")
As you can see above, we can also call functions from other files, such as the “run_workflow” function defined in the application/workflow.py
file. Once you have this function defined, you can commit it to your repository and connect it to your cloud function. Please refer to the official documentation for more information.
Parser Operation
The core of our system is the PDF parser. The concept is to extract the job description content and return it in a structured format to be utilized later in our CRM application. The approach chosen was pdfplumber
to extract the raw content of the PDF and GPT-4
to comprehend the raw content and convert it into a structured format. With the help of the instructor
library, we can direct the LLM model to return the response as per a defined schema.
Firstly, let’s define a job description schema, and then we can instruct the LLM model to return a response in this format:
from pydantic import BaseModel
class JobDetail(BaseModel):
job_name: str
ref_number: str
contact_first_name: str
contact_last_name: str
department: str
company: str
location: str
account: str
hours_per_week: int
deadline_date: str
deadline_time: str
quote_request_date: str
request_date: str
duration_months: int
desired_start_date: str|None
Even though we are utilising the instructor library to manage the LLM response, it is essential to craft a well-structured prompt. A common best practice is to offer clear instructions and provide examples of the expected output based on the previously defined schema. Here is an example:
system_prompt_job_details = """
You are a best-in-class system that parses Dutch job descriptions.
You are factual.
You never hallucinate.
You always ONLY return JSON.
Your objective is to take as input the following text from
a parsed .pdf file and extract the required text following these
definitions.
Extract the required text from the section that describes the:
'Inhurend manager', 'Indienen offertes', 'Naam hoofdstandplaats',
'Datum offerte aanvraag','Functienaam', 'Uren per week',
'Soort aanvrag', 'Aantal maanden initiële inhuurtermijn',
'Referentienummer', 'Departement', 'Afdeling/Bedrijfsonderdeel',
'Deelnemende dienst', 'Organisatie', 'Eventueel Maximum Uurtarief'
OR 'Ongewijzigde herhalingsaanvraag'.
- When you occur into dates e.g. maandag 11 december 2023 convert it
into 2023-12-11.
- When you occur into location, please, only return the city name
and/or the state name. e.g. 'Utrecht (NVWA)', return only
'Utrecht'.
- If you don't find the deadline time, please, return 23:59 as
default.
- The `contact_first_name` and `contact_last_name` must to be filled
using the 'Manager' value.
Do not translate the output response text to English.
Provide the returned results in a JSON file.
If you don't have enough information you provide a empty JSON
or empty field.
As an example, when you get a parsed text as the following:
{system_prompt_document_example}
the expected JSON response should be:
```
"details": {
"job_name": "Data Engineer"
"ref_number": "4445324"
"contact_first_name": "Samuel"
"contact_last_name": "Favarin"
"department": "Engineering"
"company": "Data Build Company"
"location": "Utrecht"
"account": "Test"
"hours_per_week": 36
"deadline_date": "2024-12-15"
"deadline_time": "23:59"
"quote_request_date": "2024-12-11"
"request_date": "2024-12-11"
"duration_months": 12
"desired_start_date": "2024-12-20"
}
```
"""
Also for the raw content extraction, the pdfplumber
library can be used like:
import logging
import pdfplumber
from io import BytesIO
from pdfminer.high_level import extract_text
def parse_pdf(pdf_file:BytesIO) -> str:
"This function will extract the raw content of the .pdf file"
try:
pdf=pdfplumber.open(pdf_file)
pdf_pages_list = []
for page in pdf.pages:
text = page.extract_text()
pdf_pages_list.append(text)
return "".join([page_text.replace('*','') for page_text in pdf_pages_list])
except Exception as e:
logging.error(f"Error on parse pdf. Error={e}")
return ""
Once we have extracted the data, we can proceed to run our LLM OpenAI API and retrieve the response.
import logging
import instructor
from openai import OpenAI
from typing_extensions import TypedDict
def run(self, system_prompt:str, file_content:str, response_model: any):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": file_content}
]
response = self.client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=messages,
response_model=JobDetail, # The schema defined previously
response_format={"type": "json_object"},
seed=12345,
temperature=0.0,
max_tokens=4096,
max_retries=4
)
logging.info(f"Received response from OpenAI: {response}")
if not response:
raise ValueError("No content received from OpenAI response")
return response
With this method, our system can accurately extract job description information and return the content in a structured manner. For further details, you can refer to the documentation to get a better understanding of each parameter of OpenAI and how the instructor library guarantees that the response aligns with the defined schema.
Adding Monitoring Alerts with Slack
It’s important once the workflow is deployed and running, that we’re alerted if something goes wrong. To cover this scenario, we created an integration with our Slack account, and for each new issue in our workflow execution, we’ll receive a message in our Slack channel. Slack provide a library called slack_sdk
, so you can setup your connection providing the api token, requested via Slack website, and the channel id that you want to send the message. This next code shows the implementation of this integration.
from slack_sdk import WebClient
class SlackClient:
def __init__(self, token:str, channel:str):
self.channel = channel
self.client = WebClient(token=token)
def alert(self, data:dict, alert_icon:str = ":alert:"):
blocks:list = [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"""{alert_icon}
*We've encountered an error
while creating the Job Description.*
{alert_icon}"""
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": f"*{data['job_description']}*"
},
{
"type": "mrkdwn",
"text": f"*Job Hash:* {data['job_hash']}"
},
{
"type": "mrkdwn",
"text": f"*Path:* {data['data_path']}"
},
{
"type": "mrkdwn",
"text": f"*Datetime:* {data['datetime']}"
},
{
"type": "mrkdwn",
"text": f"*Message:* {data['message']}"
}
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f""
},
}
]
return self.client.chat_postMessage(
channel=self.channel,
blocks=blocks)
And you can call
import SlackClient
import datetime
def send_slack_alert(job_name:str, pdf_path:str, job_hash:str, message:str, alert_icon=":alert:") -> bool:
try:
slack_client = SlackClient(token=settings.slack_token,
channel=settings.slack_channel)
slack_client.alert(
data= {
"job_description": job_name,
"job_hash": job_hash,
"data_path": pdf_path,
"datetime": datetime.datetime.now(),
"message": message,
"log_url": "https://portal.azure.com/"
},
alert_icon=alert_icon
)
return True
except Exception as e:
logging.info(f"Error! pdf_path={pdf_path}, error={e}")
return False
As the result, you’ll receive an alert in your Slack channel
Conclusion
The full solution is currently operational in production and is saving valuable time for tech recruiters. The LLM models have proven to work very effectively in parsing information. It’s also important to mention that this solution does not replace the recruiters, but rather enhances the manual process to increase capacity towards value-add activities.
Regarding costs, even in a small-scale scenario, the infrastructure is affordable. The Azure services cost approximately $4.00 per month, and the OpenAI services (based on consumption) are around $50.00 per month (subject to the amount of job descriptions to parse).
For more information, please check the links below.
References: