Beyond OCR: Benchmarking Question-Answering on Complex Industrial PDFs with TIA-pdf-QA-Bench

ThirdAI Automation Team
Beyond OCR: Benchmarking Question-Answering on Complex Industrial PDFs with TIA-pdf-QA-Bench
Document QAIndustrial AutomationOCR BenchmarkingRAG SystemsPDF ProcessingThirdAI

Introduction

A growing number of benchmark datasets have emerged for evaluating document understanding, particularly in areas like Optical Character Recognition (OCR), information extraction, and question answering (QA). However, many existing benchmarks rely on clean document formats or fail to evaluate end-to-end QA pipeline quality. This makes them insufficient for assessing real-world industrial documents where noise, formatting variability, and complex semantics are the norm.

At ThirdAI Automation, we address a fundamental question: How well can we answer questions based on complex, semi-structured PDF documents from industrial domains? To solve this, we created TIA-pdf-QA-Bench, a new benchmark evaluating end-to-end QA performance over PDFs with emphasis on retrieval-augmented generation (RAG) pipelines.

Why Traditional OCR Benchmarks Fall Short

OCR performance is traditionally evaluated in isolation using word/character accuracy metrics. While useful for assessing text extraction fidelity, this approach misses a crucial downstream impact: How do OCR mistakes affect real use cases like question answering?

For example:

  • A single misrecognized term in a spec sheet might negligibly impact OCR scores
  • The same error could derail a QA system extracting critical parameters
  • We don't just care if text is readable—we care if it's useful for the task

The Real Challenge: Retrieval and Understanding

OCR is just the beginning. Extracted text must be chunked, linked, and indexed to enable effective retrieval and reasoning. Industrial documents present unique challenges:

  • Long documents with heterogeneous formatting
  • Tables, figures, and side-by-side layouts
  • Implicit references and domain-specific terminology
  • Dense hierarchical structures (specifications, standards)

In TIA-pdf-QA-Bench, we found that text chunking and representation structure profoundly impacts QA performance. Poor chunking leads to:

  • Missed answers
  • Irrelevant retrievals
  • Hallucinations in generative models

About TIA-pdf-QA-Bench

TIA-pdf-QA-Bench evaluates QA quality on real-world industrial documents with these features:

  • Uses authentic PDFs from industrial partners and public sources
  • Simulates realistic QA scenarios (domain terminology, multi-hop reasoning)
  • Evaluates end-to-end pipeline (OCR → preprocessing → retrieval → answer generation)

We tested multiple RAG pipelines using:

  • OCR tools: Tesseract, Azure OCR
  • Chunking strategies: Fixed-length vs. semantic
  • Retrieval methods: Dense vs. sparse

Key Insights

ThirdAI Automation's RAG framework achieved the highest QA accuracy in our benchmark:

QA accuracy comparison of different RAG frameworks ThirdAI's RAG framework outperformed alternatives in industrial document QA

More benchmark details will be released in an upcoming research paper!

What's Next?

TIA-pdf-QA-Bench advances realistic, task-oriented evaluation of document intelligence systems. We're expanding the benchmark with:

  • More document types
  • Richer annotations
  • Harder questions to identify failure cases

Example document from our benchmark: Mechanical drawing sample from benchmark documents Complex industrial documents in our benchmark require advanced understanding capabilities

Grow With Us

Working on industrial document QA or building PDF reasoning systems? Reach out to our team

We're releasing an API for testing our OCR, Chunking and RAG functionalities! Join the waitlist for early access.

Get early access to TIA-PDF-QA-Bench