LIVE PRODUCT
Crawl API
Crawl API handles same-origin discovery, robots-aware crawling, page caps, and progress tracking through one async contract. It is built for teams that need bounded site walks without owning the queueing and crawl-control layer themselves.
Endpoint
POST /v1/crawl
Poll GET /v1/crawl/:jobId and cancel queued jobs with DELETE /v1/crawl/:jobId.
Credits
1 credit per completed page
Credits are reserved up front based on max_pages and finalized as the crawl completes.
Output
Progress JSON with page status list
Read crawl progress, job status, and per-page status from the status endpoint.
Request parameters
| Name | Type | Required | Description |
|---|---|---|---|
| domain | string | Yes | Starting domain or URL. HTTPS is assumed if omitted. |
| max_pages | integer | No | Maximum pages to crawl. Defaults to 50. Range 1-500. |
| depth | integer | No | Link depth from the seed URL. Defaults to 3. Range 1-5. |
| include_patterns | string[] | No | Optional glob-style path allowlist. |
| exclude_patterns | string[] | No | Optional glob-style path denylist. |
| render_js | boolean | No | Use browser-backed rendering for page fetches when true. Defaults to false. |
| use_proxy | auto | always | never | No | Proxy strategy for fetches. Defaults to auto. |
| webhook_url | string | No | Optional completion webhook target. |
Response fields
| Name | Type | Description |
|---|---|---|
| job_id | string | Crawl job identifier. |
| request_id | string | Request identifier for tracing the public API call. |
| trace_id | string | Trace identifier attached to the crawl lifecycle. |
| status | string | Current job status such as queued, running, completed, failed, cancelled, or expired. |
| domain | string | Normalized crawl seed URL. |
| max_pages | integer | Configured crawl page budget. |
| depth | integer | Configured crawl depth. |
| pages_found | integer | Pages discovered so far. |
| pages_completed | integer | Pages completed successfully. |
| pages_failed | integer | Pages that failed or were skipped. |
| credits_reserved | integer | Credits reserved from the configured crawl budget. |
| credits_charged | integer | Credits actually charged so far. |
| webhook_url | string | null | Configured webhook target, if any. |
| error | object | undefined | Present on failed, cancelled, or expired jobs. |
| pages | array | Per-page status objects with url, status, title, error_code, and fetched_at. |
Code examples
Switch languages in one place. The examples below all follow the current Crawl API contract.
Start with the raw HTTP request and poll flow.
curl -X POST "https://api.orbitscraper.com/v1/crawl" \
-H "x-api-key: ORS_live_1234567890" \
-H "Content-Type: application/json" \
-d '{
"domain": "https://docs.example.com",
"max_pages": 25,
"depth": 2,
"include_patterns": ["/blog/**", "/docs/**"],
"exclude_patterns": ["/account/**"],
"render_js": false,
"use_proxy": "auto"
}'
curl -X GET "https://api.orbitscraper.com/v1/crawl/crawl_123456" \
-H "x-api-key: ORS_live_1234567890"
curl -X DELETE "https://api.orbitscraper.com/v1/crawl/crawl_123456" \
-H "x-api-key: ORS_live_1234567890"Response examples
This is the payload shape you get back from the current public API contract for Crawl API.
Queued response
The first response confirms the job was accepted and tells you what to poll.
{
"request_id": "req_xyz",
"trace_id": "trace_xyz",
"job_id": "crawl_123456",
"status": "queued",
"crawl_credits_reserved": 25
}Completed response
After polling, this is the final payload your app reads.
{
"job_id": "crawl_123456",
"request_id": "req_xyz",
"trace_id": "trace_xyz",
"status": "completed",
"domain": "https://docs.example.com",
"max_pages": 25,
"depth": 2,
"pages_found": 18,
"pages_completed": 16,
"pages_failed": 2,
"credits_reserved": 25,
"credits_charged": 16,
"webhook_url": null,
"pages": [
{
"url": "https://docs.example.com/getting-started",
"status": "completed",
"title": "Getting started",
"error_code": null,
"fetched_at": "2026-03-27T10:40:00.000Z"
}
]
}Operational notes
- The current public contract uses domain, include_patterns, and exclude_patterns rather than singular url or pattern fields.
- Only queued crawl jobs can be cancelled through DELETE /v1/crawl/:jobId.
- The current deployment bills 1 credit per completed page and reserves credits from the max_pages budget up front.