Web-to-Markdown

Scrapes URLs and returns clean Markdown for RAG/LLM contexts. Zero configuration. Sub-second response.

POST /api/extract
Data Processing
~50ms avg latency
API Key auth
99.9% uptime

Description

Extracts clean, readable Markdown from any web page URL. Strips non-essential HTML elements (navigation, footer, scripts, styles, iframes) and converts the main content to GitHub-flavored Markdown optimized for LLM and RAG contexts. Includes SSRF protection that blocks private IP ranges.

Key Features

  • Extracts clean, readable Markdown from any web page
  • Strips non-essential HTML tags to optimize for LLM token usage
  • Preserves document structure with headings, lists, and links
  • SSRF protection blocking private IP ranges
  • 5MB response size limit and 15-second timeout
  • Strips nav, footer, script, style, noscript, iframe, svg, header, and aside tags

Code Examples

curl -X POST https://api.atomicapis.dev/api/extract \
  -H "X-RapidAPI-Proxy-Secret: YOUR_SECRET" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article"
  }'
Response
{
  "markdown": "# Understanding Modern Web Development\n\nWeb development has evolved significantly over the past decade..."
}

Parameters

Name Type Required Description
url string Yes The URL to extract content from. Must be an absolute HTTP or HTTPS URL.

Response Schema

JSON Schema
{
  "type": "object",
  "properties": {
    "markdown": {
      "type": "string",
      "description": "The extracted page content converted to GitHub-flavored Markdown"
    }
  }
}

Use Cases

RAG Pipeline Ingestion

Feed web content directly into your Retrieval-Augmented Generation pipelines with clean, token-optimized Markdown.

LLM Context Preparation

Prepare web articles and documentation as structured context for large language model prompts.

Documentation Migration

Migrate existing web-based documentation to Markdown-based systems like Docusaurus or MkDocs.

Build Constraints

Implementation Notes

This API uses specific techniques to optimize content extraction for LLM consumption:

  • Strips non-essential HTML tags to save tokens
  • Uses ReverseMarkdown library for HTML-to-Markdown conversion
  • Implements TrimmerRootAssembly for aggressive content cleaning

MCP Integration MCP Ready

What is MCP?

Model Context Protocol (MCP) allows AI assistants like Claude to call this API as a native tool during conversation. Instead of writing HTTP requests, the AI invokes the tool directly — no API keys or boilerplate needed on the client side.

Tool Details

Tool Class
WebToMarkdownTools
Method
ExtractMarkdown()

Description

Scrapes a URL and converts page content to clean Markdown for RAG/LLM contexts