Not Python? API Reference

Ensure you have set your Web Transpose API Key.

The crawled data gets downloaded to crawl.output_dir which defaults to ./webtranspose-output.

Accessing Individual Pages

You can also access the crawled data via the Python SDK API.

# get visited_urls
visited_urls = crawl.get_visited()

visited_urls
# ['https://www.example.com', 'https://www.example.com/about']

for x in visited_urls:
    page_details = crawl.get_page(x)

Example output

{
    "url": "https://www.example.com",
    "html": <RAW HTML>,
    "text": <TEXT FROM HTML OR PDF>,
    "child_urls": [
        "https://www.example.com/about",
        "https://www.example.com/contact"
    ],
    "parent_urls": [
        "https://www.example.com",
        "https://www.example.com/child-page-1"
    ],
}

Download Crawl Data to Disk

This will download all of the data to the disk. Defaults to ./webtranspose-output.

crawl.download()

Get a Page Child URLs

import webtranspose as webt
url = "https://www.webtranspose.com"
crawl = Crawl(
    url,
    max_pages=5,
)
child_urls = crawl.get_page_child_urls(url)

Get Visited Pages

visited_urls = crawl.get_visited()

visited_urls
# ['https://www.example.com', 'https://www.example.com/about']

Get Ignored Pages

ignored_urls = crawl.get_ignored()

ignored_urls
# ['https://www.example.com', 'https://www.example.com/about']

Get Queued Pages

queued_urls = crawl.get_queued()

queued_urls
# ['https://www.example.com', 'https://www.example.com/about']

Get Banned URLs

banned_urls = crawl.get_banned()

banned_urls
# ['https://www.example.com', 'https://www.example.com/about']