Web Page Scrape Block
The Web Page Scrape block lets you extract content from any public web page. It’s perfect for pulling in articles, documentation, product info, or any web-based data you want to process in your workflow.
Learning Objectives:
- Understand how to configure the Web Page Scrape Block.
- Explore best practices for effective web scraping.
Configuration Options:
- URL:
- Specifies the web page to be scraped.
- Ensure the URL is valid and accessible. This input supports Jinja templating, allowing for dynamic URL construction based on workflow state.
- Tip: Always verify the URL's accessibility to avoid errors during the scraping process.
- Include Selectors:
- Use Include Selectors to extract just the parts of a web page you want, making your workflow results cleaner, more focused, and easier to process.
- Provide a comma-separated list of classes, IDs, or tags to include. When including an ID append "#" before the ID, when excluding a class append "." before the class.
Why Use Include Selectors?
- Focus on Relevant Content:
- Many web pages have lots of extra elements—menus, ads, footers, sidebars, etc. If you only want the main article, a product description, or a specific table, you can use include selectors to grab just those elements.
- Cleaner Data for Downstream Blocks:
- By including only what you need, you reduce noise and make it easier for LLMs or other blocks to process the data accurately.
- Stay Within Token Limits:
- Limiting the scrape to specific sections helps keep the output under the 128,000 token limit, which is important for processing in LLMs.
- Exclude Selectors:
- Allows you to exclude specific elements from the scraped content, cutting down on unnecessary information that may muddle results.
- Provide a comma-separated list of classes, IDs, or tags to exclude. When excluding an ID append "#" before the ID, when excluding a class append "." before the class.
- Tip: Use this feature to remove unwanted elements from the HTML output, such as ads or navigation bars. For more information on dialing in your scrapes be sure to check out this Scout blog which dives into the strengths and weaknesses of each text-extractor and how to properly exclude classes and ID's.
Outputs
- The block outputs the HTML content of the web page and can be further processed or integrated within the workflow.
Micro-Challenge:
- Use the Web Page Scrape block to extract just the main article text from a news or blog page—removing navigation bars, sidebars, ads, and footers. Try the template below to experiment with returning clean, focused content in your scrape.
Instructions:
- Pick a news or blog article (e.g., from BBC, Medium, or your favorite tech blog).
- Paste the article’s URL into the Web Page Scrape block’s URL field or add a second input field titled "URL". This allows you to ask a question in the message field, and the tool will look for an answer on the specified webpage. If you added the second input, the URL to scrape can be dynamically inserted using Jinja:
{{inputs.url}}
. - Inspect the page (right-click → “Inspect” in your browser) to find the CSS selector for the main article content (e.g.,
.article-body
,#main
,.post-content
). - Enter this selector in the Include Selectors field of the Web Page Scrape block.
- Run the workflow and check the output.
- Did you get just the article text, without menus, ads, or comments?
- Bonus: Try a different article and selector to see how results change.