Not Python? API Reference
The crawled data gets downloaded to crawl.output_dir
which defaults to ./webtranspose-output
.
Accessing Individual Pages
You can also access the crawled data via the Python SDK API.
# get visited_urls
visited_urls = crawl.get_visited()
visited_urls
# ['https://www.example.com', 'https://www.example.com/about']
for x in visited_urls:
page_details = crawl.get_page(x)
Example output
{
"url": "https://www.example.com",
"html": <RAW HTML>,
"text": <TEXT FROM HTML OR PDF>,
"child_urls": [
"https://www.example.com/about",
"https://www.example.com/contact"
],
"parent_urls": [
"https://www.example.com",
"https://www.example.com/child-page-1"
],
}
Download Crawl Data to Disk
This will download all of the data to the disk. Defaults to ./webtranspose-output
.
Get a Page Child URLs
import webtranspose as webt
url = "https://www.webtranspose.com"
crawl = Crawl(
url,
max_pages=5,
)
child_urls = crawl.get_page_child_urls(url)
Get Visited Pages
visited_urls = crawl.get_visited()
visited_urls
# ['https://www.example.com', 'https://www.example.com/about']
Get Ignored Pages
ignored_urls = crawl.get_ignored()
ignored_urls
# ['https://www.example.com', 'https://www.example.com/about']
Get Queued Pages
queued_urls = crawl.get_queued()
queued_urls
# ['https://www.example.com', 'https://www.example.com/about']
Get Banned URLs
banned_urls = crawl.get_banned()
banned_urls
# ['https://www.example.com', 'https://www.example.com/about']