Beyond OCR: Benchmarking Question-Answering on Complex Industrial PDFs with TIA-pdf-QA-Bench

Introduction

A growing number of benchmark datasets have emerged for evaluating document understanding, particularly in areas like Optical Character Recognition (OCR), information extraction, and question answering (QA). However, many existing benchmarks rely on clean document formats or fail to evaluate end-to-end QA pipeline quality. This makes them insufficient for assessing real-world industrial documents where noise, formatting variability, and complex semantics are the norm.

At ThirdAI Automation, we address a fundamental question: How well can we answer questions based on complex, semi-structured PDF documents from industrial domains? To solve this, we created TIA-pdf-QA-Bench, a new benchmark evaluating end-to-end QA performance over PDFs with emphasis on retrieval-augmented generation (RAG) pipelines.

Why Traditional OCR Benchmarks Fall Short

OCR performance is traditionally evaluated in isolation using word/character accuracy metrics. While useful for assessing text extraction fidelity, this approach misses a crucial downstream impact: How do OCR mistakes affect real use cases like question answering?

For example:

A single misrecognized term in a spec sheet might negligibly impact OCR scores
The same error could derail a QA system extracting critical parameters
We don't just care if text is readable—we care if it's useful for the task

The Real Challenge: Retrieval and Understanding

OCR is just the beginning. Extracted text must be chunked, linked, and indexed to enable effective retrieval and reasoning. Industrial documents present unique challenges:

Long documents with heterogeneous formatting
Tables, figures, and side-by-side layouts
Implicit references and domain-specific terminology
Dense hierarchical structures (specifications, standards)

In TIA-pdf-QA-Bench, we found that text chunking and representation structure profoundly impacts QA performance. Poor chunking leads to:

Missed answers
Irrelevant retrievals
Hallucinations in generative models

About TIA-pdf-QA-Bench

TIA-pdf-QA-Bench evaluates QA quality on real-world industrial documents with these features:

Uses authentic PDFs from industrial partners and public sources
Simulates realistic QA scenarios (domain terminology, multi-hop reasoning)
Evaluates end-to-end pipeline (OCR → preprocessing → retrieval → answer generation)

We tested multiple RAG pipelines using:

OCR tools: Tesseract, Azure OCR
Chunking strategies: Fixed-length vs. semantic
Retrieval methods: Dense vs. sparse

Key Insights

ThirdAI Automation's RAG framework achieved the highest QA accuracy in our benchmark:

QA accuracy comparison of different RAG frameworks ThirdAI's RAG framework outperformed alternatives in industrial document QA

More benchmark details will be released in an upcoming research paper!

What's Next?

TIA-pdf-QA-Bench advances realistic, task-oriented evaluation of document intelligence systems. We're expanding the benchmark with:

More document types
Richer annotations
Harder questions to identify failure cases

Example document from our benchmark: Mechanical drawing sample from benchmark documents Complex industrial documents in our benchmark require advanced understanding capabilities

Grow With Us

Working on industrial document QA or building PDF reasoning systems? Reach out to our team

We're releasing an API for testing our OCR, Chunking and RAG functionalities! Join the waitlist for early access.