LIVE PRODUCT

Crawl API

Crawl a bounded site section and monitor progress through one queue-backed job.

Crawl API handles same-origin discovery, robots-aware crawling, page caps, and progress tracking through one async contract. It is built for teams that need bounded site walks without owning the queueing and crawl-control layer themselves.

Endpoint

POST /v1/crawl

Poll GET /v1/crawl/:jobId and cancel queued jobs with DELETE /v1/crawl/:jobId.

Credits

5 credits per completed page

Credits are reserved up front based on max_pages and finalized as the crawl completes.

Output

Progress JSON with page status list

Read crawl progress, job status, and per-page status from the status endpoint.

What it's for

knowledge-base ingestion from a bounded docs or blog section
site discovery with include and exclude controls
content monitoring for site sections that change over time
crawl progress tracking in internal admin tools
page-by-page crawl pipelines that need clear billing and cancellation rules

How it works

1Submit a domain plus crawl limits, patterns, and optional webhook settings.
2OrbitScraper runs a bounded crawl, tracks discovered pages, and stores per-page status as the job progresses.
3Poll for progress, read the final page list, or cancel the crawl while it is still queued.

Request parameters

These are the fields accepted by the current backend contract for POST /v1/crawl.

Name	Type	Required	Description
domain	string	Yes	Starting domain or URL. HTTPS is assumed if omitted.
max_pages	integer	No	Maximum pages to crawl. Defaults to 50. Range 1-500.
depth	integer	No	Link depth from the seed URL. Defaults to 3. Range 1-5.
include_patterns	string[]	No	Optional glob-style path allowlist.
exclude_patterns	string[]	No	Optional glob-style path denylist.
render_js	boolean	No	Use browser-backed rendering for page fetches when true. Defaults to false.
use_proxy	auto \| always \| never	No	Proxy strategy for fetches. Defaults to auto.
webhook_url	string	No	Optional completion webhook target.

Response fields

These fields describe the completed payload you read from the current public API contract.

Name	Type	Description
job_id	string	Crawl job identifier.
request_id	string	Request identifier for tracing the public API call.
trace_id	string	Trace identifier attached to the crawl lifecycle.
status	string	Current job status such as queued, running, completed, failed, cancelled, or expired.
domain	string	Normalized crawl seed URL.
max_pages	integer	Configured crawl page budget.
depth	integer	Configured crawl depth.
pages_found	integer	Pages discovered so far.
pages_completed	integer	Pages completed successfully.
pages_failed	integer	Pages that failed or were skipped.
credits_reserved	integer	Credits reserved from the configured crawl budget.
credits_charged	integer	Credits actually charged so far.
webhook_url	string \| null	Configured webhook target, if any.
error	object \| undefined	Present on failed, cancelled, or expired jobs.
pages	array	Per-page status objects with url, status, title, error_code, and fetched_at.

Code examples

Start with cURL, then switch to Python, JavaScript, Java, or PHP for the same Crawl API flow.

Start with the raw HTTP request and poll flow.

bash

curl -X POST "https://api.orbitscraper.com/v1/crawl" \
  -H "x-api-key: ORS_live_1234567890" \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "https://docs.example.com",
    "max_pages": 25,
    "depth": 2,
    "include_patterns": ["/blog/**", "/docs/**"],
    "exclude_patterns": ["/account/**"],
    "render_js": false,
    "use_proxy": "auto"
  }'

curl -X GET "https://api.orbitscraper.com/v1/crawl/crawl_123456" \
  -H "x-api-key: ORS_live_1234567890"

curl -X DELETE "https://api.orbitscraper.com/v1/crawl/crawl_123456" \
  -H "x-api-key: ORS_live_1234567890"

Response examples

This is the shape you get back from the current public API contract for Crawl API.

Queued response

The first response confirms the job was accepted and tells you what to poll next.

json

{
  "request_id": "req_xyz",
  "trace_id": "trace_xyz",
  "job_id": "crawl_123456",
  "status": "queued",
  "crawl_credits_reserved": 25
}

Completed response

After polling, this is the final payload shape your app reads.

json

{
  "job_id": "crawl_123456",
  "request_id": "req_xyz",
  "trace_id": "trace_xyz",
  "status": "completed",
  "domain": "https://docs.example.com",
  "max_pages": 25,
  "depth": 2,
  "pages_found": 18,
  "pages_completed": 16,
  "pages_failed": 2,
  "credits_reserved": 25,
  "credits_charged": 16,
  "webhook_url": null,
  "pages": [
    {
      "url": "https://docs.example.com/getting-started",
      "status": "completed",
      "title": "Getting started",
      "error_code": null,
      "fetched_at": "2026-03-27T10:40:00.000Z"
    }
  ]
}

The current public contract uses domain, include_patterns, and exclude_patterns rather than singular url or pattern fields.

Only queued crawl jobs can be cancelled through DELETE /v1/crawl/:jobId.

The current deployment bills 1 credit per completed page and reserves credits from the max_pages budget up front.

Ready to build on Crawl API?

The current backend contract is already live. Use the docs page for request details and the pricing page for credit planning.

View Crawl API docs Talk to the team