LIVE PRODUCT
Crawl API
Crawl a bounded site section and monitor progress through one queue-backed job.
Crawl API handles same-origin discovery, robots-aware crawling, page caps, and progress tracking through one async contract. It is built for teams that need bounded site walks without owning the queueing and crawl-control layer themselves.
Endpoint
POST /v1/crawl
Poll GET /v1/crawl/:jobId and cancel queued jobs with DELETE /v1/crawl/:jobId.
Credits
1 credit per completed page
Credits are reserved up front based on max_pages and finalized as the crawl completes.
Output
Progress JSON with page status list
Read crawl progress, job status, and per-page status from the status endpoint.
What it's for
- knowledge-base ingestion from a bounded docs or blog section
- site discovery with include and exclude controls
- content monitoring for site sections that change over time
- crawl progress tracking in internal admin tools
- page-by-page crawl pipelines that need clear billing and cancellation rules
How it works
- 1Submit a domain plus crawl limits, patterns, and optional webhook settings.
- 2OrbitScraper runs a bounded crawl, tracks discovered pages, and stores per-page status as the job progresses.
- 3Poll for progress, read the final page list, or cancel the crawl while it is still queued.
Request parameters
These are the fields accepted by the current backend contract for POST /v1/crawl.
| Name | Type | Required | Description |
|---|---|---|---|
| domain | string | Yes | Starting domain or URL. HTTPS is assumed if omitted. |
| max_pages | integer | No | Maximum pages to crawl. Defaults to 50. Range 1-500. |
| depth | integer | No | Link depth from the seed URL. Defaults to 3. Range 1-5. |
| include_patterns | string[] | No | Optional glob-style path allowlist. |
| exclude_patterns | string[] | No | Optional glob-style path denylist. |
| render_js | boolean | No | Use browser-backed rendering for page fetches when true. Defaults to false. |
| use_proxy | auto | always | never | No | Proxy strategy for fetches. Defaults to auto. |
| webhook_url | string | No | Optional completion webhook target. |
Response fields
These fields describe the completed payload you read from the current public API contract.
| Name | Type | Description |
|---|---|---|
| job_id | string | Crawl job identifier. |
| request_id | string | Request identifier for tracing the public API call. |
| trace_id | string | Trace identifier attached to the crawl lifecycle. |
| status | string | Current job status such as queued, running, completed, failed, cancelled, or expired. |
| domain | string | Normalized crawl seed URL. |
| max_pages | integer | Configured crawl page budget. |
| depth | integer | Configured crawl depth. |
| pages_found | integer | Pages discovered so far. |
| pages_completed | integer | Pages completed successfully. |
| pages_failed | integer | Pages that failed or were skipped. |
| credits_reserved | integer | Credits reserved from the configured crawl budget. |
| credits_charged | integer | Credits actually charged so far. |
| webhook_url | string | null | Configured webhook target, if any. |
| error | object | undefined | Present on failed, cancelled, or expired jobs. |
| pages | array | Per-page status objects with url, status, title, error_code, and fetched_at. |
Code examples
Start with cURL, then switch to Python, JavaScript, Java, or PHP for the same Crawl API flow.
Start with the raw HTTP request and poll flow.
curl -X POST "https://api.orbitscraper.com/v1/crawl" \
-H "x-api-key: ORS_live_1234567890" \
-H "Content-Type: application/json" \
-d '{
"domain": "https://docs.example.com",
"max_pages": 25,
"depth": 2,
"include_patterns": ["/blog/**", "/docs/**"],
"exclude_patterns": ["/account/**"],
"render_js": false,
"use_proxy": "auto"
}'
curl -X GET "https://api.orbitscraper.com/v1/crawl/crawl_123456" \
-H "x-api-key: ORS_live_1234567890"
curl -X DELETE "https://api.orbitscraper.com/v1/crawl/crawl_123456" \
-H "x-api-key: ORS_live_1234567890"Response examples
This is the shape you get back from the current public API contract for Crawl API.
Queued response
The first response confirms the job was accepted and tells you what to poll next.
{
"request_id": "req_xyz",
"trace_id": "trace_xyz",
"job_id": "crawl_123456",
"status": "queued",
"crawl_credits_reserved": 25
}Completed response
After polling, this is the final payload shape your app reads.
{
"job_id": "crawl_123456",
"request_id": "req_xyz",
"trace_id": "trace_xyz",
"status": "completed",
"domain": "https://docs.example.com",
"max_pages": 25,
"depth": 2,
"pages_found": 18,
"pages_completed": 16,
"pages_failed": 2,
"credits_reserved": 25,
"credits_charged": 16,
"webhook_url": null,
"pages": [
{
"url": "https://docs.example.com/getting-started",
"status": "completed",
"title": "Getting started",
"error_code": null,
"fetched_at": "2026-03-27T10:40:00.000Z"
}
]
}Ready to build on Crawl API?
The current backend contract is already live. Use the docs page for request details and the pricing page for credit planning.