Web-to-Markdown
Scrapes URLs and returns clean Markdown for RAG/LLM contexts. Zero configuration. Sub-second response.
/api/extract
Description
Extracts clean, readable Markdown from any web page URL. Strips non-essential HTML elements (navigation, footer, scripts, styles, iframes) and converts the main content to GitHub-flavored Markdown optimized for LLM and RAG contexts. Includes SSRF protection that blocks private IP ranges.
Key Features
- Extracts clean, readable Markdown from any web page
- Strips non-essential HTML tags to optimize for LLM token usage
- Preserves document structure with headings, lists, and links
- SSRF protection blocking private IP ranges
- 5MB response size limit and 15-second timeout
- Strips nav, footer, script, style, noscript, iframe, svg, header, and aside tags
Code Examples
curl -X POST https://api.atomicapis.dev/api/extract \
-H "X-RapidAPI-Proxy-Secret: YOUR_SECRET" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article"
}'
const response = await fetch('https://api.atomicapis.dev/api/extract', {
method: 'POST',
headers: {
'X-RapidAPI-Proxy-Secret': 'YOUR_SECRET',
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: 'https://example.com/article'
})
});
const data = await response.json();
console.log(data);
import requests
response = requests.post(
'https://api.atomicapis.dev/api/extract',
headers={
'X-RapidAPI-Proxy-Secret': 'YOUR_SECRET',
'Content-Type': 'application/json'
},
json={
'url': 'https://example.com/article'
}
)
data = response.json()
print(data)
using System.Net.Http.Json;
var client = new HttpClient();
client.DefaultRequestHeaders.Add("X-RapidAPI-Proxy-Secret", "YOUR_SECRET");
var request = new
{
url = "https://example.com/article"
};
var response = await client.PostAsJsonAsync(
"https://api.atomicapis.dev/api/extract",
request
);
var data = await response.Content.ReadFromJsonAsync<object>();
Console.WriteLine(data);
{
"markdown": "# Understanding Modern Web Development\n\nWeb development has evolved significantly over the past decade..."
}
Parameters
| Name | Type | Required | Description |
|---|---|---|---|
url |
string | Yes | The URL to extract content from. Must be an absolute HTTP or HTTPS URL. |
Response Schema
{
"type": "object",
"properties": {
"markdown": {
"type": "string",
"description": "The extracted page content converted to GitHub-flavored Markdown"
}
}
}
Use Cases
RAG Pipeline Ingestion
Feed web content directly into your Retrieval-Augmented Generation pipelines with clean, token-optimized Markdown.
LLM Context Preparation
Prepare web articles and documentation as structured context for large language model prompts.
Documentation Migration
Migrate existing web-based documentation to Markdown-based systems like Docusaurus or MkDocs.
Build Constraints
Implementation Notes
This API uses specific techniques to optimize content extraction for LLM consumption:
- Strips non-essential HTML tags to save tokens
- Uses ReverseMarkdown library for HTML-to-Markdown conversion
- Implements TrimmerRootAssembly for aggressive content cleaning
MCP Integration MCP Ready
What is MCP?
Model Context Protocol (MCP) allows AI assistants like Claude to call this API as a native tool during conversation. Instead of writing HTTP requests, the AI invokes the tool directly — no API keys or boilerplate needed on the client side.
Tool Details
WebToMarkdownTools
ExtractMarkdown()
Description
Scrapes a URL and converts page content to clean Markdown for RAG/LLM contexts