Two Paths to Document Extraction: Comparing Rule-Based OCR and LLM Approaches for B2B Orders

Introduction

In the world of B2B operations, extracting structured data from PDF orders is a common yet challenging task. Traditional methods rely on rule-based systems and optical character recognition (OCR), while newer approaches leverage large language models (LLMs) to understand and parse document content. This article compares two implementations of a document extractor for B2B orders: a classic rules-based pipeline using pytesseract and a modern LLM-based solution using Ollama with LLaMA 3. Both were built on the same realistic order dataset to provide a fair, practical comparison.

Two Paths to Document Extraction: Comparing Rule-Based OCR and LLM Approaches for B2B Orders — Source: towardsdatascience.com

Understanding B2B Document Extraction

B2B documents, such as purchase orders, invoices, and shipping bills, contain critical information like line items, quantities, prices, and vendor details. Extracting these fields accurately is essential for automation, inventory management, and accounting. The complexity arises from varied layouts, fonts, and occasional mixed text and tables.

The Two Approaches at a Glance

Rule-Based (pytesseract): Uses OCR to convert PDF images to text, then applies regular expressions and custom logic to locate and extract specific fields.
LLM-Based (Ollama + LLaMA 3): Feeds the full or preprocessed document text to a local LLM, prompting it to identify and output the desired fields in a structured format.

The Rule-Based Approach with pytesseract

The first extractor was built using the classic OCR pipeline. After converting PDF pages to images, pytesseract (a Python wrapper for Tesseract OCR) was used to extract raw text. Then, a series of handcrafted rules were applied:

Standardize the text by removing noise (e.g., stray characters, page numbers).
Use regular expressions to match patterns for order numbers, dates, line items (quantity, SKU, price), and totals.
Apply heuristics for common layouts (e.g., assuming the vendor address is in the top left corner).
Validate extracted values (e.g., checking date formats, price range).

The rule-based approach performed well on documents that matched the expected templates. However, even slight variations—such as a new font or a reorganized table—required manual rule updates. Maintenance became a growing overhead as the number of document types increased.

The LLM-Based Approach with Ollama and LLaMA 3

The second extractor replaced explicit rules with a local LLM. Using Ollama to serve LLaMA 3 (8B parameters), the system took the raw OCR output (or even the raw image when using vision models) and prompted the model to extract the relevant fields.

Read the PDF and convert to text (using pytesseract for a fair comparison) or directly pass page images to a multimodal LLM.
Construct a prompt that describes the expected output schema (e.g., JSON with fields: order_id, date, vendor, line_items, total).
Send the document content along with the prompt to the LLM and parse the returned JSON.
Optionally, use a second pass to correct common errors (e.g., hallucinated line items) by re-prompting with a confidence check.

The LLM-based approach showed impressive flexibility. It handled unseen layouts without rule changes, correctly interpreting tables, mixed text, and even handwriting in some cases. However, it required careful prompt engineering and occasionally returned hallucinated data.

Head-to-Head Comparison

Accuracy

On the test set of 200 real B2B orders, the rule-based extractor achieved 92% field-level accuracy for standard documents. The LLM-based approach scored 94%, but its errors were often less predictable (e.g., inventing line items). For documents outside the training templates, the rules dropped to 65% while the LLM maintained 85%.

Flexibility

Rules: Low flexibility. Every new layout requires manual regex crafting. LLM: High flexibility. One prompt works across many formats, but edge cases (tables with merged cells) can confuse the model.

Cost and Speed

Rule-based extraction is very cheap (only CPU usage) and fast—0.5 seconds per page. The LLM-based approach requires a GPU for acceptable speed, taking 3–5 seconds per page on a consumer GPU, and incurs higher energy costs. For a high‑volume pipeline (thousands of documents daily), the rule-based method wins on cost.

Maintenance

Rules: High maintenance. Each template addition means new code. LLM: Low maintenance. Occasional prompt tweaks, but the model can be updated by swapping a newer checkpoint.

When to Use Each Approach

Choose a rule-based extractor when:

You have a small, fixed set of document templates.
Speed and cost are paramount.
You need deterministic, auditable extraction.

Choose an LLM-based extractor when:

You handle many varied document layouts.
You can tolerate minor hallucinations with manual review.
You have access to local or cloud GPU resources.

Conclusion

Both methods have strengths. The rule-based approach using pytesseract is fast, cheap, and reliable for stable document sets. The LLM-based approach with Ollama and LLaMA 3 offers unmatched flexibility and adapts to new layouts without code changes. In practice, a hybrid system—rules for standard documents, LLM fallback for exceptions—often yields the best results. Whichever path you choose, understanding these trade-offs is key to building a robust B2B document extraction pipeline.

Tags: