Skip to content

CLI Reference

The BookWyrm client includes a comprehensive command-line interface for all text processing operations. The documentation below is automatically generated from the CLI code.

bookwyrm

BookWyrm Client CLI - Accelerate RAG and AI agent development.

The BookWyrm client provides powerful text processing capabilities through a simple CLI, making it easy to build sophisticated document analysis and citation systems.

Key Capabilities

  • Citation Finding - Find relevant citations for questions in text chunks
  • Text Summarization - Generate summaries with custom Pydantic models
  • Phrasal Analysis - Extract phrases and chunks from text using NLP
  • PDF Extraction - Extract structured text data from PDFs with OCR
  • File Classification - Intelligently classify files by format and type
  • Streaming Support - Real-time progress updates for all operations

Environment Variables

  • BOOKWYRM_API_KEY - Your BookWyrm API key (required)
  • BOOKWYRM_API_URL - Base URL (default: https://api.bookwyrm.ai:443)
  • BOOKWYRM_PDF_API_URL - PDF API URL (falls back to BOOKWYRM_API_URL)

Get your API key at: https://api.bookwyrm.ai

Usage:

 [OPTIONS] COMMAND [ARGS]...

Options:

  --version             Show version and exit
  --install-completion  Install completion for the current shell.
  --show-completion     Show completion for the current shell, to copy it or
                        customize the installation.

cite

Find citations for questions in text chunks.

This command searches through text chunks to find relevant citations that answer questions. It supports both local JSONL files and remote URLs, and can handle single or multiple questions.

Input Format

The JSONL file should contain text chunks in this format:

{"text": "chunk text", "start_char": 0, "end_char": 10}

Question Input Methods

  1. Single question: Use --question "Your question here"
  2. Multiple questions: Use --question multiple times: --question "Q1" --question "Q2"
  3. Questions file: Use --questions-file questions.txt with one question per line

Examples

# Single question (uses swift model by default)
bookwyrm cite --question "What is machine learning?" ml_chunks.jsonl

# High-quality citation finding
bookwyrm cite --question "What is AI?" data.jsonl --model-strength wise

# Multiple questions with advanced reasoning
bookwyrm cite --question "What is AI?" --question "How does ML work?" data.jsonl --model-strength clever

# Questions from file with maximum sophistication
bookwyrm cite --questions-file questions.txt data.jsonl --model-strength brainiac -o citations.json

# From URL with multiple questions
bookwyrm cite --question "Q1" --question "Q2" --url https://example.com/chunks.jsonl

# Limit processing with smart model
bookwyrm cite --question "Question" data.jsonl --start 10 --limit 50 --model-strength smart

Model Strength Levels

  • swift: Fast processing for quick results
  • smart: Intelligent analysis with good quality
  • clever: Advanced reasoning capabilities
  • wise: High-quality analysis for important content
  • brainiac: Maximum sophistication for complex questions
    ## Output Formats
    
    - **JSON (non-streaming)**: Array of citation objects
    - **JSONL (streaming)**: One citation per line as they're found
    - For multiple questions, citations include question index and text
    
    Usage:
    
    cite [OPTIONS][JSONL_INPUT]
    Options:
    
    [JSONL_INPUT] Path to JSONL file containing text chunks (optional if using --file or --url) -q, --question TEXT Question to find citations for (can be used multiple times) --questions-file PATH File containing questions, one per line --url TEXT URL to JSONL file (alternative to file path) --file PATH JSONL file to read chunks from -o, --output PATH Output file for citations (JSON for non- streaming, JSONL for streaming) --start INTEGER Start chunk index (default: 0) [default: 0] --limit INTEGER Limit number of chunks to process --max-tokens INTEGER Maximum tokens per chunk (default: 1000) [default: 1000] --model-strength [swift|smart|clever|wise|brainiac] Model strength level: swift (fast), smart (intelligent), clever (advanced), wise (high-quality), brainiac (maximum sophistication) [default: swift] --base-url TEXT Base URL of the BookWyrm API (overrides BOOKWYRM_API_URL env var) --api-key TEXT API key for authentication (overrides BOOKWYRM_API_KEY env var) -v, --verbose Show detailed citation information --long Show full citation text without truncation --timeout FLOAT Request timeout in seconds (default: no timeout)
    #### classify
    
    Classify files to determine their type and format.
    
    This command analyzes files, URLs, or stdin content to determine their format type,
    content type, MIME type, and other classification details.
    
    ## Classification Includes
    
    - **Format type** (text, image, binary, archive, etc.)
    - **Content type** (python_code, json_data, jpeg_image, etc.)
    - **MIME type** detection
    - **Confidence score** (0.0-1.0)
    - **Additional details** (encoding, language, etc.)
    
    ## Examples
    
    ```bash
    # Classify local file
    bookwyrm classify --file document.pdf
    
    # Classify from URL
    bookwyrm classify --url https://example.com/file.dat
    
    # Classify from stdin
    echo "import pandas as pd" | bookwyrm classify --filename script.py
    
    # With output file
    bookwyrm classify --file unknown_file.bin --output classification.json
    
    # With filename hint
    bookwyrm classify --file data.txt --filename "research_data.csv" --output results.json
    

Output Format

JSON file containing classification results, file size, and sample preview

Usage:

 classify [OPTIONS]

Options:

  --url TEXT         URL to classify
  --file PATH        File to classify
  --filename TEXT    Optional filename hint for classification
  -o, --output PATH  Output file for classification results (JSON format)
  --base-url TEXT    Base URL of the BookWyrm API (overrides BOOKWYRM_API_URL
                     env var)
  --api-key TEXT     API key for authentication (overrides BOOKWYRM_API_KEY
                     env var)
  -v, --verbose      Show detailed information

extract-pdf

Extract structured data from PDF files using OCR.

This command extracts text elements from PDF files with position coordinates, confidence scores, and bounding box information. It supports both local files and remote URLs, with optional page range selection.

Features

  • OCR-based text extraction with confidence scores
  • Bounding box coordinates for each text element
  • Page range selection (start_page + num_pages)
  • Language support for OCR processing (default: English)
  • Streaming progress updates
  • Support for both local files and URLs

Advanced Processing Features

  • Layout Detection: --enable-layout-detection for better text structure
  • Table Recognition: --enable-table-recognition for table extraction
  • Formula Recognition: --enable-formula-recognition for math formulas
  • Seal Recognition: --enable-seal-recognition for stamps and seals
  • Chart Parsing: --enable-chart-parsing for graphs and charts
  • Document Preprocessing: --enable-document-preprocessing for better OCR
  • Model Selection: --use-lightweight-models (default) vs --use-full-models
  • Time Limits: --max-processing-time to set processing timeouts

Page Selection

  • start_page: 1-based page number to begin extraction
  • num_pages: Number of pages to process from start_page
  • Omit both to process entire document

Examples

# Extract entire PDF
bookwyrm extract-pdf document.pdf --output extracted.json

# Extract specific pages
bookwyrm extract-pdf large_doc.pdf --start-page 5 --num-pages 10 --output pages_5_14.json

# Extract from URL
bookwyrm extract-pdf --url https://example.com/document.pdf --output extracted.json

# Extract with specific language
bookwyrm extract-pdf document.pdf --lang fr --output extracted.json

# Advanced processing with layout detection
bookwyrm extract-pdf document.pdf       --layout       --output advanced_extracted.json

# Force OCR for better text quality (without layout detection)
bookwyrm extract-pdf document.pdf       --force-ocr       --output force_ocr_extracted.json

# Verbose output
bookwyrm extract-pdf document.pdf -v --output extracted.json

# Auto-save with generated filename (no --output needed)
bookwyrm extract-pdf my_document.pdf --start-page 5 --num-pages 3
# Saves to: my_document_pages_5-7_extracted.json

Output Format

JSON file containing pages array with text elements, coordinates, and metadata

Usage:

 extract-pdf [OPTIONS] [PDF_FILE]

Options:

  [PDF_FILE]            PDF file to extract from (optional if using --file or
                        --url)
  --url TEXT            PDF URL to extract from
  --file PATH           PDF file to extract from
  -o, --output PATH     Output file for extracted data (JSON format)
  --start-page INTEGER  1-based page number to start from
  --num-pages INTEGER   Number of pages to process from start_page
  --lang TEXT           Language code for OCR processing (default: en)
                        \[default: en]
  --layout              Enable advanced layout detection for better text
                        structure analysis
  --force-ocr           Force use of OCR endpoint even for native text PDFs
                        (auto-enabled with --layout)
  --timeout FLOAT       Request timeout in seconds (default: no timeout)
  --base-url TEXT       Base URL of the PDF extraction API (overrides
                        BOOKWYRM_API_URL env var)
  --api-key TEXT        API key for authentication (overrides BOOKWYRM_API_KEY
                        env var)
  -v, --verbose         Show detailed information

pdf-query-range

Query character ranges from PDF text mapping to get bounding boxes.

This command takes a character mapping JSON file (created by pdf-to-text) and returns the bounding boxes for a specified character range, grouped by page.

Examples

# Query characters 100-200 from mapping file
bookwyrm pdf-query-range data/heinrich_pages_1-4_mapping.json 100 200

# Save results to JSON file
bookwyrm pdf-query-range mapping.json 500 750 -o bounding_boxes.json

# Verbose output with detailed information
bookwyrm pdf-query-range mapping.json 0 100 -v

Output Format

Returns bounding boxes grouped by page number, with character positions and coordinates

Usage:

 pdf-query-range [OPTIONS] MAPPING_FILE START_CHAR END_CHAR

Options:

  MAPPING_FILE       Character mapping JSON file from pdf-to-text command
                     \[required]
  START_CHAR         Starting character index (inclusive)  \[required]
  END_CHAR           Ending character index (exclusive)  \[required]
  -o, --output PATH  Output file for bounding box results (JSON format)
  -v, --verbose      Show detailed information

pdf-to-text

Convert PDF extraction JSON to raw text with character position mapping.

This command takes the JSON output from the extract-pdf command and converts it to: 1. Raw text file with all text elements joined by newlines 2. Character mapping JSON that maps each character position to its bounding box coordinates

The mapping accounts for inserted newlines by assigning them the bounding box of the preceding character.

Examples

# Convert PDF extraction to raw text
bookwyrm pdf-to-text data/heinrich_pages_1-4.json

# Specify output files
bookwyrm pdf-to-text data/extracted.json -o raw_text.txt --mapping char_map.json

# Verbose output
bookwyrm pdf-to-text data/extracted.json -v

Output Files

  • Raw text file: All text elements joined with newlines
  • Mapping JSON: Character index to bounding box coordinate mapping

Usage:

 pdf-to-text [OPTIONS] JSON_FILE

Options:

  JSON_FILE          JSON file from extract-pdf command  \[required]
  -o, --output PATH  Output file for raw text (default: input_name_raw.txt)
  --mapping PATH     Output file for character mapping JSON (default:
                     input_name_mapping.json)
  -v, --verbose      Show detailed information

phrasal

Stream text processing using phrasal analysis to extract phrases or chunks.

This command breaks down text into meaningful phrases or chunks using NLP with real-time streaming results. It supports processing from direct text input, files, or URLs.

Response Formats

  • with_offsets: Include character position information (start_char, end_char)
  • text_only: Return only the text content without position data

Response Format Control

  • Default: Include character position information (with_offsets)
  • --text-only: Return only text content without position data

Chunking

Use --chunk-size to create chunks of approximately the specified character count. Without --chunk-size, returns individual phrases.

Examples

# Process text directly
bookwyrm phrasal "Natural language processing is fascinating." -o phrases.jsonl

# Process file with position offsets (default behavior)
bookwyrm phrasal -f document.txt --output phrases.jsonl

# Create chunks of specific size (with position offsets by default)
bookwyrm phrasal -f large_text.txt --chunk-size 1000 --output chunks.jsonl

# Process from URL
bookwyrm phrasal --url https://example.com/text.txt --output phrases.jsonl

# Text only format using boolean flag
bookwyrm phrasal -f text.txt --text-only --output simple_phrases.jsonl

# Text-only format (no position data)
bookwyrm phrasal -f text.txt --text-only --output simple_phrases.jsonl

Output Format

JSONL file with one phrase/chunk per line:

{"type": "text_span", "text": "phrase text", "start_char": 0, "end_char": 12}

Or for text-only format:

{"type": "text", "text": "phrase text"}

Usage:

 phrasal [OPTIONS] [INPUT_TEXT]

Options:

  [INPUT_TEXT]          Text to process (optional if using --url or --file)
  --url TEXT            URL to fetch text from
  -f, --file PATH       File to read text from
  -o, --output PATH     Output file for phrases (JSONL format)
  --chunk-size INTEGER  Target size for each chunk (if not specified, returns
                        phrases individually)
  --text-only           Return text only without position data
  --offsets             Return text with position offsets (default behavior)
  --base-url TEXT       Base URL of the BookWyrm API (overrides
                        BOOKWYRM_API_URL env var)
  --api-key TEXT        API key for authentication (overrides BOOKWYRM_API_KEY
                        env var)
  -v, --verbose         Show detailed information

summarize

Summarize text content from JSONL files.

This command performs hierarchical summarization of text phrases, with support for structured output using Pydantic models and custom prompts.

Input Format

The JSONL file should contain phrases in this format:

{"text": "phrase text", "start_char": 0, "end_char": 15}

Features

  • Structured Output: Use --model-class-file and --model-class-name to generate structured summaries that conform to your Pydantic model schema. The output file is required when using structured output.

  • Custom Prompts: Use --chunk-prompt and --summary-prompt together to customize the summarization process. Both prompts are required when using custom prompts.

Examples

# Basic summarization (uses swift model by default)
bookwyrm summarize book_phrases.jsonl --output summary.json

# High-quality summarization
bookwyrm summarize book_phrases.jsonl --model-strength wise --output summary.json

# Maximum sophistication
bookwyrm summarize complex_text.jsonl --model-strength brainiac --output summary.json

# With debug information
bookwyrm summarize data.jsonl --include-debug --output detailed_summary.json

# Larger chunks with advanced reasoning
bookwyrm summarize large_text.jsonl --max-tokens 20000 --model-strength clever --output summary.json

# Structured output with Pydantic model
bookwyrm summarize book.jsonl       --model-class-file models/book_summary.py       --model-class-name BookSummary       --model-strength smart       --output structured_summary.json

# Custom prompts with high-quality model
bookwyrm summarize scientific_text.jsonl       --chunk-prompt "Extract key scientific concepts and findings"       --summary-prompt "Create a comprehensive scientific overview"       --model-strength wise       --output science_summary.json

Model Strength Levels

  • swift: Fast processing for quick results
  • smart: Intelligent analysis with good quality
  • clever: Advanced reasoning capabilities
  • wise: High-quality analysis for important content
  • brainiac: Maximum sophistication for complex tasks

Output Format

JSON file containing summary, metadata, and optionally intermediate summaries

Usage:

 summarize [OPTIONS] JSONL_FILE

Options:

  JSONL_FILE                      JSONL file containing phrases  \[required]
  -o, --output PATH               Output file for summary (JSON format)
  --max-tokens INTEGER            Maximum tokens per chunk (max: 131,072,
                                  default: 10000)  \[default: 10000]
  --model-strength [swift|smart|clever|wise|brainiac]
                                  Model strength level: swift (fast), smart
                                  (intelligent), clever (advanced), wise
                                  (high-quality), brainiac (maximum
                                  sophistication)  \[default: swift]
  --include-debug                 Include intermediate summaries
  --model-class-file PATH         Python file containing Pydantic model class
                                  for structured output
  --model-class-name TEXT         Name of the Pydantic model class to use
                                  (required with --model-class-file)
  --chunk-prompt TEXT             Custom prompt for chunk summarization
                                  (requires --summary-prompt)
  --summary-prompt TEXT           Custom prompt for summary of summaries
                                  (requires --chunk-prompt)
  --base-url TEXT                 Base URL of the BookWyrm API (overrides
                                  BOOKWYRM_API_URL env var)
  --api-key TEXT                  API key for authentication (overrides
                                  BOOKWYRM_API_KEY env var)
  -v, --verbose                   Show detailed information

Additional Information

Environment Variables

  • BOOKWYRM_API_KEY - Your BookWyrm API key (required)
  • BOOKWYRM_API_URL - Base URL for the BookWyrm API (default: https://api.bookwyrm.ai:443)
  • BOOKWYRM_PDF_API_URL - Base URL for PDF extraction API (falls back to BOOKWYRM_API_URL)

Global Options

All commands support these global options:

  • --base-url TEXT - Override the default API base URL
  • --api-key TEXT - Provide API key (overrides BOOKWYRM_API_KEY env var)
  • --version - Show version and exit
  • --help - Show help message

Error Handling

The CLI provides helpful error messages and exit codes:

  • Exit code 0: Success
  • Exit code 1: Error (API error, file not found, invalid arguments, etc.)

Output Formats

Citations Output

JSON format (non-streaming):

[
  {
    "start_chunk": 0,
    "end_chunk": 0,
    "text": "Citation text here",
    "reasoning": "Why this citation is relevant",
    "quality": 3
  }
]

JSONL format (streaming):

{"start_chunk": 0, "end_chunk": 0, "text": "Citation 1", "reasoning": "...", "quality": 3}
{"start_chunk": 1, "end_chunk": 1, "text": "Citation 2", "reasoning": "...", "quality": 4}

Summary Output

{
  "summary": "The generated summary text or structured JSON",
  "subsummary_count": 5,
  "levels_used": 2,
  "total_tokens": 15000,
  "source_file": "input.jsonl",
  "max_tokens": 10000,
  "model_used": "BookSummary",
  "intermediate_summaries": [["level 1 summaries"], ["level 2 summaries"]]
}

Phrases Output (JSONL)

{"type": "text_span", "text": "First phrase", "start_char": 0, "end_char": 12}
{"type": "text_span", "text": "Second phrase", "start_char": 13, "end_char": 26}