LIVE PRODUCT

Crawl API

Crawl API handles same-origin discovery, robots-aware crawling, page caps, and progress tracking through one async contract. It is built for teams that need bounded site walks without owning the queueing and crawl-control layer themselves.

Endpoint

POST /v1/crawl

Poll GET /v1/crawl/:jobId and cancel queued jobs with DELETE /v1/crawl/:jobId.

Credits

5 credits per completed page

Credits are reserved up front based on max_pages and finalized as the crawl completes.

Output

Progress JSON with page status list

Read crawl progress, job status, and per-page status from the status endpoint.

Request parameters

Name	Type	Required	Description
domain	string	Yes	Starting domain or URL. HTTPS is assumed if omitted.
max_pages	integer	No	Maximum pages to crawl. Defaults to 50. Range 1-500.
depth	integer	No	Link depth from the seed URL. Defaults to 3. Range 1-5.
include_patterns	string[]	No	Optional glob-style path allowlist.
exclude_patterns	string[]	No	Optional glob-style path denylist.
render_js	boolean	No	Use browser-backed rendering for page fetches when true. Defaults to false.
use_proxy	auto \| always \| never	No	Proxy strategy for fetches. Defaults to auto.
webhook_url	string	No	Optional completion webhook target.

Response fields

Name	Type	Description
job_id	string	Crawl job identifier.
request_id	string	Request identifier for tracing the public API call.
trace_id	string	Trace identifier attached to the crawl lifecycle.
status	string	Current job status such as queued, running, completed, failed, cancelled, or expired.
domain	string	Normalized crawl seed URL.
max_pages	integer	Configured crawl page budget.
depth	integer	Configured crawl depth.
pages_found	integer	Pages discovered so far.
pages_completed	integer	Pages completed successfully.
pages_failed	integer	Pages that failed or were skipped.
credits_reserved	integer	Credits reserved from the configured crawl budget.
credits_charged	integer	Credits actually charged so far.
webhook_url	string \| null	Configured webhook target, if any.
error	object \| undefined	Present on failed, cancelled, or expired jobs.
pages	array	Per-page status objects with url, status, title, error_code, and fetched_at.

Code examples

Switch languages in one place. The examples below all follow the current Crawl API contract.

Start with the raw HTTP request and poll flow.

bash

curl -X POST "https://api.orbitscraper.com/v1/crawl" \
  -H "x-api-key: ORS_live_1234567890" \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "https://docs.example.com",
    "max_pages": 25,
    "depth": 2,
    "include_patterns": ["/blog/**", "/docs/**"],
    "exclude_patterns": ["/account/**"],
    "render_js": false,
    "use_proxy": "auto"
  }'

curl -X GET "https://api.orbitscraper.com/v1/crawl/crawl_123456" \
  -H "x-api-key: ORS_live_1234567890"

curl -X DELETE "https://api.orbitscraper.com/v1/crawl/crawl_123456" \
  -H "x-api-key: ORS_live_1234567890"

Response examples

This is the payload shape you get back from the current public API contract for Crawl API.

Queued response

The first response confirms the job was accepted and tells you what to poll.

json

{
  "request_id": "req_xyz",
  "trace_id": "trace_xyz",
  "job_id": "crawl_123456",
  "status": "queued",
  "crawl_credits_reserved": 25
}

Completed response

After polling, this is the final payload your app reads.

json

{
  "job_id": "crawl_123456",
  "request_id": "req_xyz",
  "trace_id": "trace_xyz",
  "status": "completed",
  "domain": "https://docs.example.com",
  "max_pages": 25,
  "depth": 2,
  "pages_found": 18,
  "pages_completed": 16,
  "pages_failed": 2,
  "credits_reserved": 25,
  "credits_charged": 16,
  "webhook_url": null,
  "pages": [
    {
      "url": "https://docs.example.com/getting-started",
      "status": "completed",
      "title": "Getting started",
      "error_code": null,
      "fetched_at": "2026-03-27T10:40:00.000Z"
    }
  ]
}

Operational notes

The current public contract uses domain, include_patterns, and exclude_patterns rather than singular url or pattern fields.
Only queued crawl jobs can be cancelled through DELETE /v1/crawl/:jobId.
The current deployment bills 1 credit per completed page and reserves credits from the max_pages budget up front.