Skip to content

Commit df9b457

Browse files
authored
Merge pull request #64 from ScrapeGraphAI/sitemap-endpoint-integration
feat: add sitemap endpoint
2 parents d0a10e5 + e07cd76 commit df9b457

File tree

16 files changed

+1816
-12
lines changed

16 files changed

+1816
-12
lines changed

β€Ž.agent/README.mdβ€Ž

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,7 @@ Both SDKs support the following endpoints:
212212
| SmartScraper | βœ… | βœ… | AI-powered data extraction |
213213
| SearchScraper | βœ… | βœ… | Multi-website search extraction |
214214
| Markdownify | βœ… | βœ… | HTML to Markdown conversion |
215+
| Sitemap | ❌ | βœ… | Sitemap URL extraction |
215216
| SmartCrawler | βœ… | βœ… | Sitemap generation & crawling |
216217
| AgenticScraper | βœ… | βœ… | Browser automation |
217218
| Scrape | βœ… | βœ… | Basic HTML extraction |
@@ -259,6 +260,7 @@ Both SDKs support the following endpoints:
259260
- `searchScraper.js`
260261
- `crawl.js`
261262
- `markdownify.js`
263+
- `sitemap.js`
262264
- `agenticScraper.js`
263265
- `scrape.js`
264266
- `scheduledJobs.js`

β€Žscrapegraph-js/README.mdβ€Ž

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -451,6 +451,27 @@ const url = 'https://scrapegraphai.com/';
451451
})();
452452
```
453453

454+
### Sitemap
455+
456+
Extract all URLs from a website's sitemap. Automatically discovers sitemap from robots.txt or common sitemap locations.
457+
458+
```javascript
459+
import { sitemap } from 'scrapegraph-js';
460+
461+
const apiKey = 'your-api-key';
462+
const websiteUrl = 'https://example.com';
463+
464+
(async () => {
465+
try {
466+
const response = await sitemap(apiKey, websiteUrl);
467+
console.log('Total URLs found:', response.urls.length);
468+
console.log('URLs:', response.urls);
469+
} catch (error) {
470+
console.error('Error:', error);
471+
}
472+
})();
473+
```
474+
454475
### Checking API Credits
455476

456477
```javascript
@@ -688,6 +709,21 @@ Starts a crawl job to extract structured data from a website and its linked page
688709

689710
Converts a webpage into clean, well-structured markdown format.
690711

712+
### Sitemap
713+
714+
#### `sitemap(apiKey, websiteUrl, options)`
715+
716+
Extracts all URLs from a website's sitemap. Automatically discovers sitemap from robots.txt or common sitemap locations.
717+
718+
**Parameters:**
719+
- `apiKey` (string): Your ScrapeGraph AI API key
720+
- `websiteUrl` (string): The URL of the website to extract sitemap from
721+
- `options` (object, optional): Additional options
722+
- `mock` (boolean): Override mock mode for this request
723+
724+
**Returns:** Promise resolving to an object containing:
725+
- `urls` (array): List of URLs extracted from the sitemap
726+
691727
### Agentic Scraper
692728

693729
#### `agenticScraper(apiKey, url, steps, useSession, userPrompt, outputSchema, aiExtraction)`
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Sitemap Examples
2+
3+
This directory contains examples demonstrating how to use the `sitemap` endpoint to extract URLs from website sitemaps.
4+
5+
## πŸ“ Examples
6+
7+
### 1. Basic Sitemap Extraction (`sitemap_example.js`)
8+
9+
Demonstrates the basic usage of the sitemap endpoint:
10+
- Extract all URLs from a website's sitemap
11+
- Display the URLs
12+
- Save URLs to a text file
13+
- Save complete response as JSON
14+
15+
**Usage:**
16+
```bash
17+
node sitemap_example.js
18+
```
19+
20+
**What it does:**
21+
1. Calls the sitemap API with a target website URL
22+
2. Retrieves all URLs from the sitemap
23+
3. Displays the first 10 URLs in the console
24+
4. Saves all URLs to `sitemap_urls.txt`
25+
5. Saves the full response to `sitemap_urls.json`
26+
27+
### 2. Advanced: Sitemap + SmartScraper (`sitemap_with_smartscraper.js`)
28+
29+
Shows how to combine sitemap extraction with smartScraper for batch processing:
30+
- Extract sitemap URLs
31+
- Filter URLs based on patterns (e.g., blog posts)
32+
- Scrape selected URLs with smartScraper
33+
- Display results and summary
34+
35+
**Usage:**
36+
```bash
37+
node sitemap_with_smartscraper.js
38+
```
39+
40+
**What it does:**
41+
1. Extracts all URLs from a website's sitemap
42+
2. Filters URLs (example: only blog posts or specific sections)
43+
3. Scrapes each filtered URL using smartScraper
44+
4. Extracts structured data from each page
45+
5. Displays a summary of successful and failed scrapes
46+
47+
**Use Cases:**
48+
- Bulk content extraction from blogs
49+
- E-commerce product catalog scraping
50+
- News article aggregation
51+
- Content migration and archival
52+
53+
## πŸ”‘ Setup
54+
55+
Before running the examples, make sure you have:
56+
57+
1. **API Key**: Set your ScrapeGraph AI API key as an environment variable:
58+
```bash
59+
export SGAI_APIKEY="your-api-key-here"
60+
```
61+
62+
Or create a `.env` file in the project root:
63+
```
64+
SGAI_APIKEY=your-api-key-here
65+
```
66+
67+
2. **Dependencies**: Install required packages:
68+
```bash
69+
npm install
70+
```
71+
72+
## πŸ“Š Expected Output
73+
74+
### Basic Sitemap Example Output:
75+
```
76+
πŸ—ΊοΈ Extracting sitemap from: https://example.com/
77+
⏳ Please wait...
78+
79+
βœ… Sitemap extracted successfully!
80+
πŸ“Š Total URLs found: 150
81+
82+
πŸ“„ First 10 URLs:
83+
1. https://example.com/
84+
2. https://example.com/about
85+
3. https://example.com/products
86+
...
87+
88+
πŸ’Ύ URLs saved to: sitemap_urls.txt
89+
πŸ’Ύ JSON saved to: sitemap_urls.json
90+
```
91+
92+
### Advanced Example Output:
93+
```
94+
πŸ—ΊοΈ Step 1: Extracting sitemap from: https://example.com/
95+
⏳ Please wait...
96+
97+
βœ… Sitemap extracted successfully!
98+
πŸ“Š Total URLs found: 150
99+
100+
🎯 Selected 3 URLs to scrape:
101+
1. https://example.com/blog/post-1
102+
2. https://example.com/blog/post-2
103+
3. https://example.com/blog/post-3
104+
105+
πŸ€– Step 2: Scraping selected URLs...
106+
107+
πŸ“„ Scraping (1/3): https://example.com/blog/post-1
108+
βœ… Success
109+
...
110+
111+
πŸ“ˆ Summary:
112+
βœ… Successful: 3
113+
❌ Failed: 0
114+
πŸ“Š Total: 3
115+
```
116+
117+
## πŸ’‘ Tips
118+
119+
1. **Rate Limiting**: When scraping multiple URLs, add delays between requests to avoid rate limiting
120+
2. **Error Handling**: Always use try/catch blocks to handle API errors gracefully
121+
3. **Filtering**: Use URL patterns to filter specific sections (e.g., `/blog/`, `/products/`)
122+
4. **Batch Size**: Start with a small batch to test before processing hundreds of URLs
123+
124+
## πŸ”— Related Documentation
125+
126+
- [Sitemap API Documentation](../../README.md#sitemap)
127+
- [SmartScraper Documentation](../../README.md#smart-scraper)
128+
- [ScrapeGraph AI API Docs](https://docs.scrapegraphai.com)
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
import { sitemap } from 'scrapegraph-js';
2+
import fs from 'fs';
3+
import 'dotenv/config';
4+
5+
/**
6+
* Example: Extract sitemap URLs from a website
7+
*
8+
* This example demonstrates how to use the sitemap endpoint to extract
9+
* all URLs from a website's sitemap.xml file.
10+
*/
11+
12+
// Get API key from environment variable
13+
const apiKey = process.env.SGAI_APIKEY;
14+
15+
// Target website URL
16+
const url = 'https://scrapegraphai.com/';
17+
18+
console.log('πŸ—ΊοΈ Extracting sitemap from:', url);
19+
console.log('⏳ Please wait...\n');
20+
21+
try {
22+
// Call the sitemap endpoint
23+
const response = await sitemap(apiKey, url);
24+
25+
console.log('βœ… Sitemap extracted successfully!');
26+
console.log(`πŸ“Š Total URLs found: ${response.urls.length}\n`);
27+
28+
// Display first 10 URLs
29+
console.log('πŸ“„ First 10 URLs:');
30+
response.urls.slice(0, 10).forEach((url, index) => {
31+
console.log(` ${index + 1}. ${url}`);
32+
});
33+
34+
if (response.urls.length > 10) {
35+
console.log(` ... and ${response.urls.length - 10} more URLs`);
36+
}
37+
38+
// Save the complete list to a file
39+
saveUrlsToFile(response.urls, 'sitemap_urls.txt');
40+
41+
// Save as JSON for programmatic use
42+
saveUrlsToJson(response, 'sitemap_urls.json');
43+
44+
} catch (error) {
45+
console.error('❌ Error:', error.message);
46+
process.exit(1);
47+
}
48+
49+
/**
50+
* Helper function to save URLs to a text file
51+
*/
52+
function saveUrlsToFile(urls, filename) {
53+
try {
54+
const content = urls.join('\n');
55+
fs.writeFileSync(filename, content);
56+
console.log(`\nπŸ’Ύ URLs saved to: ${filename}`);
57+
} catch (err) {
58+
console.error('❌ Error saving file:', err.message);
59+
}
60+
}
61+
62+
/**
63+
* Helper function to save complete response as JSON
64+
*/
65+
function saveUrlsToJson(response, filename) {
66+
try {
67+
fs.writeFileSync(filename, JSON.stringify(response, null, 2));
68+
console.log(`πŸ’Ύ JSON saved to: ${filename}`);
69+
} catch (err) {
70+
console.error('❌ Error saving JSON:', err.message);
71+
}
72+
}
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
import { sitemap, smartScraper } from 'scrapegraph-js';
2+
import 'dotenv/config';
3+
4+
/**
5+
* Advanced Example: Extract sitemap and scrape selected URLs
6+
*
7+
* This example demonstrates how to combine the sitemap endpoint
8+
* with smartScraper to extract structured data from multiple pages.
9+
*/
10+
11+
const apiKey = process.env.SGAI_APIKEY;
12+
13+
// Configuration
14+
const websiteUrl = 'https://scrapegraphai.com/';
15+
const maxPagesToScrape = 3; // Limit number of pages to scrape
16+
const userPrompt = 'Extract the page title and main heading';
17+
18+
console.log('πŸ—ΊοΈ Step 1: Extracting sitemap from:', websiteUrl);
19+
console.log('⏳ Please wait...\n');
20+
21+
try {
22+
// Step 1: Get all URLs from sitemap
23+
const sitemapResponse = await sitemap(apiKey, websiteUrl);
24+
25+
console.log('βœ… Sitemap extracted successfully!');
26+
console.log(`πŸ“Š Total URLs found: ${sitemapResponse.urls.length}\n`);
27+
28+
// Step 2: Filter URLs (example: only blog posts)
29+
const filteredUrls = sitemapResponse.urls
30+
.filter(url => url.includes('/blog/') || url.includes('/post/'))
31+
.slice(0, maxPagesToScrape);
32+
33+
if (filteredUrls.length === 0) {
34+
console.log('ℹ️ No blog URLs found, using first 3 URLs instead');
35+
filteredUrls.push(...sitemapResponse.urls.slice(0, maxPagesToScrape));
36+
}
37+
38+
console.log(`🎯 Selected ${filteredUrls.length} URLs to scrape:`);
39+
filteredUrls.forEach((url, index) => {
40+
console.log(` ${index + 1}. ${url}`);
41+
});
42+
43+
// Step 3: Scrape each selected URL
44+
console.log('\nπŸ€– Step 2: Scraping selected URLs...\n');
45+
46+
const results = [];
47+
48+
for (let i = 0; i < filteredUrls.length; i++) {
49+
const url = filteredUrls[i];
50+
console.log(`πŸ“„ Scraping (${i + 1}/${filteredUrls.length}): ${url}`);
51+
52+
try {
53+
const scrapeResponse = await smartScraper(
54+
apiKey,
55+
url,
56+
userPrompt
57+
);
58+
59+
results.push({
60+
url: url,
61+
data: scrapeResponse.result,
62+
status: 'success'
63+
});
64+
65+
console.log(' βœ… Success');
66+
67+
// Add a small delay between requests to avoid rate limiting
68+
if (i < filteredUrls.length - 1) {
69+
await new Promise(resolve => setTimeout(resolve, 1000));
70+
}
71+
72+
} catch (error) {
73+
console.log(` ❌ Failed: ${error.message}`);
74+
results.push({
75+
url: url,
76+
error: error.message,
77+
status: 'failed'
78+
});
79+
}
80+
}
81+
82+
// Step 4: Display results
83+
console.log('\nπŸ“Š Scraping Results:\n');
84+
results.forEach((result, index) => {
85+
console.log(`${index + 1}. ${result.url}`);
86+
if (result.status === 'success') {
87+
console.log(' Status: βœ… Success');
88+
console.log(' Data:', JSON.stringify(result.data, null, 2));
89+
} else {
90+
console.log(' Status: ❌ Failed');
91+
console.log(' Error:', result.error);
92+
}
93+
console.log('');
94+
});
95+
96+
// Summary
97+
const successCount = results.filter(r => r.status === 'success').length;
98+
console.log('πŸ“ˆ Summary:');
99+
console.log(` βœ… Successful: ${successCount}`);
100+
console.log(` ❌ Failed: ${results.length - successCount}`);
101+
console.log(` πŸ“Š Total: ${results.length}`);
102+
103+
} catch (error) {
104+
console.error('❌ Error:', error.message);
105+
process.exit(1);
106+
}

0 commit comments

Comments
Β (0)