Skip to main content
Help centre

Document intake · 6 min

Read extraction states and scanned PDFs

How queued, processing, succeeded, failed, and scanned-PDF no-text extraction states work, including OCR and manual transcription fallbacks.

Help baseline: 2026-06-15

extraction statusscanned PDFOCR fallbackhuman reviewreplacement guidancemanual transcription

Follow the extraction state

Extraction status moves through pending, queued, processing, succeeded, failed, or scanned_pdf_no_text. Source-register rows also show review handoff labels such as awaiting_upload, scan_clean, extraction_queued, extraction_processing, extraction_failed, scanned_pdf_no_text, extracted_unlinked, review_ready, review_approved, and review_rejected.

  • queued means the source cleared scanning and is waiting for the extraction worker.
  • processing means the worker is reading the stored file into reviewable evidence.
  • succeeded means text was extracted, but the linked evidence still needs the appropriate human review state before board-pack reliance.
  • scanned_pdf_no_text means the PDF appears image-only and accepted OCR did not produce reviewable text, so a text-bearing replacement or manual transcription is needed before review.

Handle scanned PDFs honestly

Current extraction can read text-bearing PDFs, DOCX, XLSX, CSV, plain text, and accepted scanned-PDF OCR output in the document-worker path. If OCR is disabled, the file exceeds configured bounds, or OCR does not produce reviewable text, the row stays scanned_pdf_no_text.

  • If a PDF has no embedded text and accepted OCR does not clear the quality bar, the source row stays blocked before review instead of pretending evidence was extracted.
  • Accepted OCR text enters the same needs_review evidence queue as other extracted text; it does not bypass named human review or display confidence theatre.
  • Replace the file with a text-bearing PDF, DOCX, XLSX, CSV, or plain text source when possible.
  • If the scanned file must remain the source of truth, keep the blocker visible until OCR succeeds or a manual transcription replacement closes it.

Transcribe unreadable scans

Manual transcription is the fallback when the scan is still the source of truth but OCR cannot create reviewable text. Treat the transcript as replacement source evidence, not as silent repair of the original scan.

  • Create a text-bearing transcript that names the original file, page range, transcriber, transcription date, and any uncertain words or omissions.
  • Upload the transcript as a replacement source or text-bearing companion, then process extraction so it becomes needs_review evidence.
  • Keep the original scanned row visible for lineage until the replacement transcript is reviewed and approved.
  • Reviewer notes should state that the evidence item relies on a manual transcript of the scanned source.

Resolve extraction failures

A source row marked Unable to extract means the extraction worker ran but could not produce reviewable text — no evidence was created from the file. The row stays blocked before review until the failure is resolved.

  • Re-run extraction if the failure may have been transient (worker timeout, storage hiccup).
  • Replace the source file with a supported format (text-bearing PDF, DOCX, XLSX, CSV, or plain text) if the original file is the cause.
  • Nothing is implied about document contents when extraction fails — no fabricated evidence is created and the source register shows the honest Unable to extract state.
  • Only once extraction succeeds and the resulting evidence reaches review_ready can the source contribute to a board pack.

Boundary

DefenceFile help explains workflow operation. It does not provide legal advice, create privilege, certify scope, certify reasonable procedures, or guarantee that a statutory defence will succeed.

Request pilot review