Toolkit / Human → Machine

Turn documents into clean, machine-readable markdown.

A rule-based conversion pipeline for URLs, PDFs, and pasted text. Instant. Deterministic. Free. No AI, no tokens, no data retention.

Local-first · No tokens · No tracking

Any blog post, article, or documentation page.

Best for documents up to around 100 pages. Very long books may hit processing limits — split into chapters for best results.

How it works

The document is classified first. Then the right rules are applied.

01

Extractable text PDF

Best case. Extract text directly, remove headers/footers/page numbers, reflow paragraphs, normalize structure, repair OCR seams.

02

Image-only PDF

Inspect visually. OCR page-by-page. Clean text aggressively. Preserve page sequence. Use visible headings and contents for section structure.

03

Mixed-content visual PDF

Books with ad plates, image-heavy pages mixed with prose, quotation collections. Separate text pages from image pages. Transcribe fully, never force broken OCR through dense plates.

Operating spec

Five non-negotiable constraints.

01

Context over convenience

Never summarize when conversion is what was asked. Books, essays, and quotations stay text-for-text close to source. Normalize where OCR broke, not where prose is intact.

02

Reading flow first

No broken mid-sentence page transitions. No "Cha Pter" split words. No running headers in body text. The output should read continuously, as a human would read it aloud.

03

Machine-parseable hierarchy

Explicit markdown structure. Title, author, edition, contents, chapters, notes — all under predictable headings. Chunkable for RAG without guesswork.

04

Minimal editorial invention

Add structure. Repair seams. Move footnotes. But never paraphrase where transcription is possible. Never replace damaged text with invented summaries.

05

Honest uncertainty

If a page is unreadable, say so. Never use lazy placeholders where the page is mostly recoverable.

Quality bar

The output is good enough when another model can ingest it reliably.

Good enough

  • — Structure is explicit
  • — Prose reads continuously
  • — Obvious OCR failures are repaired
  • — Pages and chapters are chunkable
  • — Source context is preserved
  • — Another model could ingest it reliably

Not good enough

  • — Reads like OCR soup
  • — Headings clean but paragraphs broken
  • — File is mostly placeholders
  • — Paraphrases instead of transcribing
  • — Publication data is corrupted
  • — Body is not trustworthy for parsing

Privacy

PDFs are processed entirely in your browser.

PDFs never upload

PDF parsing runs inside your browser via pdfjs-dist. The file's bytes never touch any server — not ours, not a third party. Confirmable in your network tab.

No third parties

No LLMs, no cloud OCR, no external conversion APIs. URL fetches go directly from a stateless server function to the page you pasted, then discard.

No tracking

No analytics on document contents, no saved history, no account required. The tool doesn't know what you converted after the conversion finishes.

Inputs

Three ways in. One consistent output.

URL

Paste a link. We fetch, strip chrome (nav, ads, footer), and convert.

PDF

Up to 20MB. Layout-aware extraction in your browser — running headers and footers detected by cross-page repetition, paragraphs recovered from line spacing.

Text

Paste raw text, HTML, or OCR output for normalization and structuring.