daisi-broski-skim

URL in. Clean article out.

A production-grade content extraction API for Markdown, JSON, and reader-mode HTML. Feed it any public URL and get back the article body, metadata, inline links, hero image, and tables — without a browser or a headless Chrome. Works on HTML pages, Word documents, Excel workbooks, and PDFs.

What you get

Every skim returns the same structured payload, ready to render or index.

📝

Article body

Readability-style scoring identifies the main content root and strips the navigation, ads, and comment rails.

🏷️

Metadata

Title, byline, published date, language, site name, and description — merged from Open Graph, Twitter cards, JSON-LD, and semantic HTML.

🔗

Inline links

Hyperlinks flow through from HTML anchors AND from PDF /Link annotations, landing in article.Links with their anchor text.

🖼️

Hero image

Open Graph image, Twitter card image, or the first in-body image. JPEG XObjects inside PDFs surface too.

📊

Real tables

HTML tables, docx / xlsx tables, and PDF coordinate-grid tables all render as Markdown pipe tables with merged-header consolidation.

📄

Docs & PDFs

The same API unpacks Word documents via OOXML, Excel workbooks as per-sheet tables, and PDFs with encryption + CID fonts via a from-scratch BCL-only parser.

Ready to ship?

Create a key, paste the curl example, get JSON back. Takes a minute.