daisi-broski-skim

URL in. Clean article out.

A production-grade content extraction API for Markdown, JSON, and reader-mode HTML. Feed it any public URL and get back the article body, metadata, inline links, hero image, and tables — without a browser or a headless Chrome. Works on HTML pages, Word documents, Excel workbooks, and PDFs.

Get an API key View API docs

Windows PowerShell

PS C:\> npm install -g @daisinet/broski

added 2 packages in 2.1s

PS C:\> broski https://en.wikipedia.org/wiki/JavaScript --format md

# JavaScript - Wikipedia

on _en.wikipedia.org_ • [source]

JavaScript (/ˈdʒɑːvəskrɪpt/), often abbreviated as JS, is a
programming language and core technology of the Web...

## History

### Creation at Netscape
...

What you get

Every skim returns the same structured payload, ready to render or index.

📝

Article body

Readability-style scoring identifies the main content root and strips the navigation, ads, and comment rails.

🏷️

Metadata

Title, byline, published date, language, site name, and description — merged from Open Graph, Twitter cards, JSON-LD, and semantic HTML.

🔗

Inline links

Hyperlinks flow through from HTML anchors AND from PDF /Link annotations, landing in article.Links with their anchor text.

🖼️

Hero image

Open Graph image, Twitter card image, or the first in-body image. JPEG XObjects inside PDFs surface too.

📊

Real tables

HTML tables, docx / xlsx tables, and PDF coordinate-grid tables all render as Markdown pipe tables with merged-header consolidation.

📄

Docs & PDFs

The same API unpacks Word documents via OOXML, Excel workbooks as per-sheet tables, and PDFs with encryption + CID fonts via a from-scratch BCL-only parser.

Ready to ship?

Create a key, paste the curl example, get JSON back. Takes a minute.

Get an API key Read the docs