Toolkit / Human → Machine
Turn documents into clean, machine-readable markdown.
A rule-based conversion pipeline for URLs, PDFs, and pasted text. Instant. Deterministic. Free. No AI, no tokens, no data retention.
Best for documents up to around 100 pages. Very long books may hit processing limits — split into chapters for best results.
How it works
The document is classified first. Then the right rules are applied.
01
Extractable text PDF
Best case. Extract text directly, remove headers/footers/page numbers, reflow paragraphs, normalize structure, repair OCR seams.
02
Image-only PDF
Inspect visually. OCR page-by-page. Clean text aggressively. Preserve page sequence. Use visible headings and contents for section structure.
03
Mixed-content visual PDF
Books with ad plates, image-heavy pages mixed with prose, quotation collections. Separate text pages from image pages. Transcribe fully, never force broken OCR through dense plates.
Operating spec
Five non-negotiable constraints.
01
Context over convenience
Never summarize when conversion is what was asked. Books, essays, and quotations stay text-for-text close to source. Normalize where OCR broke, not where prose is intact.
02
Reading flow first
No broken mid-sentence page transitions. No "Cha Pter" split words. No running headers in body text. The output should read continuously, as a human would read it aloud.
03
Machine-parseable hierarchy
Explicit markdown structure. Title, author, edition, contents, chapters, notes — all under predictable headings. Chunkable for RAG without guesswork.
04
Minimal editorial invention
Add structure. Repair seams. Move footnotes. But never paraphrase where transcription is possible. Never replace damaged text with invented summaries.
05
Honest uncertainty
If a page is unreadable, say so. Never use lazy placeholders where the page is mostly recoverable.
Quality bar
The output is good enough when another model can ingest it reliably.
Good enough
- — Structure is explicit
- — Prose reads continuously
- — Obvious OCR failures are repaired
- — Pages and chapters are chunkable
- — Source context is preserved
- — Another model could ingest it reliably
Not good enough
- — Reads like OCR soup
- — Headings clean but paragraphs broken
- — File is mostly placeholders
- — Paraphrases instead of transcribing
- — Publication data is corrupted
- — Body is not trustworthy for parsing
Privacy
PDFs are processed entirely in your browser.
PDFs never upload
PDF parsing runs inside your browser via pdfjs-dist. The file's bytes never touch any server — not ours, not a third party. Confirmable in your network tab.
No third parties
No LLMs, no cloud OCR, no external conversion APIs. URL fetches go directly from a stateless server function to the page you pasted, then discard.
No tracking
No analytics on document contents, no saved history, no account required. The tool doesn't know what you converted after the conversion finishes.
Inputs
Three ways in. One consistent output.
URL
Paste a link. We fetch, strip chrome (nav, ads, footer), and convert.
Up to 20MB. Layout-aware extraction in your browser — running headers and footers detected by cross-page repetition, paragraphs recovered from line spacing.
Text
Paste raw text, HTML, or OCR output for normalization and structuring.