Webeater

Quick out of the box web content extractor in Python aimed at AI agents.

Webeater - The web content extractor for AI agents

WebEater (weat)

WebEater is a web content extraction tool designed to fetch and process web pages.
It is made for developers and researchers who need to extract structured data from web pages efficiently.
The tool goes straight to the point, focusing on extracting text and structured data from web pages, while providing some additional configurations and hits for better effectiveness.

Its main purpose is to serve as a go-to-component that works out of the box for most general use cases.

As it’s currently at an early stage, it may not cover all edge cases or complex scenarios.
We welcome contributions and feedback to help improve its capabilities.

Main Features

  • Fetches web pages and extracts text content into Markdown format.
  • Return clean, plain text or a JSON object optionally containing lists of images and links found on the page.
  • Handles JavaScript-heavy pages using Selenium and BeautifulSoup
  • Can be used both as a library and a command-line tool (CLI).

Quick Start (CLI)

To use WebEater from the command line, first install it using pip:

pip install webeater

Then, you can run it with a URL using the weat CLI tool

weat https://example.com

This will fetch the content of the page and print the extracted text to the console.

CLI Options

You can customize the behavior of WebEater using various command-line options:

  • url (positional): URL to fetch content from. If omitted, WebEater starts an interactive prompt.
  • -c, –config FILE (default: weat.json): Config file to use.
  • –hints FILE [FILE …]: Additional hint files to load (space-separated paths).
  • –debug: Enable debug logging.
  • –silent: Silent mode — suppress debug/info messages; only print results or errors to allow calling from scripts or subprocesses.
  • –json: Return content as JSON instead of plain text.
  • –content-only: Return only the main extracted content (skip extracting links and images).

Examples:

# Basic usage
webeater https://example.com

# JSON output and content-only
webeater --json --content-only https://example.com

# Using a custom config and multiple hint files
webeater -c weat.json --hints hints/news.json hints/sports.json https://example.com

Interactive mode (when no URL is provided):

  • Enter a URL when prompted to fetch content.
  • Prefix shortcuts per request:
    • j! → return JSON
    • c! → content only
    • jc! or cj! → JSON + content only
  • q → quit the interactive session

Notes:

  • URLs must start with http:// or https://.
  • In silent mode, only the result or an error line (prefixed with “Error:”) is printed.

Quick Start (Python)

To use WebEater, first install it using pip:

pip install webeater

You can then import the Webeater class and create an instance of it.
The engine will automatically load the necessary configurations and provide methods to perform web content extraction actions.

Note that it must be loaded within an async context.

Below is a minimal example:

import asyncio
from webeater import Webeater

async def main():
    weat = await Webeater.create()
    content = await weat.get(url="https://example.com")
    print(content)

asyncio.run(main())

Help and Contributions

For questions or discussions about changes and new features, please start a new Discussion in the Webeater GitHub repository.

If you find bugs or want to contribute, please open an Issue.

Develop with Source

To develop with WebEater from source code, you can clone the repository at:

https://github.com/tiagrib/webeater.git

then navigate to the project directory and install the required dependencies:

pip install -r requirements.txt

The current code was tested using python version 3.12.3, though other versions may work.

Configuration and Advanced documentation

Web Eater uses a configuration file to manage its settings. The configuration file is typically located at config/weat.yaml.

You can customize the settings in this file to suit your needs, such as specifying the default user agent, timeout settings, and other parameters.

For more detailed documentation on configuration options and advanced usage, please refer to the Hints Documentation.

Tiago Ribeiro
Tiago Ribeiro
AI Technology & Product Consulting

Eclectic scientist and engineer striving to breathe the Illusion of Life into autonomous characters