CLI Reference¶
The BookWyrm client includes a comprehensive command-line interface for all text processing operations. The documentation below is automatically generated from the CLI code.
bookwyrm¶
BookWyrm Client CLI - Accelerate RAG and AI agent development.
The BookWyrm client provides powerful text processing capabilities through a simple CLI, making it easy to build sophisticated document analysis and citation systems.
Key Capabilities¶
- Citation Finding - Find relevant citations for questions in text chunks
- Text Summarization - Generate summaries with custom Pydantic models
- Phrasal Analysis - Extract phrases and chunks from text using NLP
- PDF Extraction - Extract structured text data from PDFs with OCR
- File Classification - Intelligently classify files by format and type
- Streaming Support - Real-time progress updates for all operations
Environment Variables¶
BOOKWYRM_API_KEY- Your BookWyrm API key (required)BOOKWYRM_API_URL- Base URL (default: https://api.bookwyrm.ai:443)BOOKWYRM_PDF_API_URL- PDF API URL (falls back to BOOKWYRM_API_URL)
Get your API key at: https://api.bookwyrm.ai
Usage:
Options:
--version Show version and exit
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or
customize the installation.
cite¶
Find citations for questions in text chunks.
This command searches through text chunks to find relevant citations that answer questions. It supports both local JSONL files and remote URLs, and can handle single or multiple questions.
Input Format¶
The JSONL file should contain text chunks in this format:
Question Input Methods¶
- Single question: Use
--question "Your question here" - Multiple questions: Use
--questionmultiple times:--question "Q1" --question "Q2" - Questions file: Use
--questions-file questions.txtwith one question per line
Examples¶
# Single question (uses swift model by default)
bookwyrm cite --question "What is machine learning?" ml_chunks.jsonl
# High-quality citation finding
bookwyrm cite --question "What is AI?" data.jsonl --model-strength wise
# Multiple questions with advanced reasoning
bookwyrm cite --question "What is AI?" --question "How does ML work?" data.jsonl --model-strength clever
# Questions from file with maximum sophistication
bookwyrm cite --questions-file questions.txt data.jsonl --model-strength brainiac -o citations.json
# From URL with multiple questions
bookwyrm cite --question "Q1" --question "Q2" --url https://example.com/chunks.jsonl
# Limit processing with smart model
bookwyrm cite --question "Question" data.jsonl --start 10 --limit 50 --model-strength smart
Model Strength Levels¶
- swift: Fast processing for quick results
- smart: Intelligent analysis with good quality
- clever: Advanced reasoning capabilities
- wise: High-quality analysis for important content
- brainiac: Maximum sophistication for complex questions
cite [OPTIONS][JSONL_INPUT] [JSONL_INPUT] Path to JSONL file containing text chunks (optional if using --file or --url) -q, --question TEXT Question to find citations for (can be used multiple times) --questions-file PATH File containing questions, one per line --url TEXT URL to JSONL file (alternative to file path) --file PATH JSONL file to read chunks from -o, --output PATH Output file for citations (JSON for non- streaming, JSONL for streaming) --start INTEGER Start chunk index (default: 0) [default: 0] --limit INTEGER Limit number of chunks to process --max-tokens INTEGER Maximum tokens per chunk (default: 1000) [default: 1000] --model-strength [swift|smart|clever|wise|brainiac] Model strength level: swift (fast), smart (intelligent), clever (advanced), wise (high-quality), brainiac (maximum sophistication) [default: swift] --base-url TEXT Base URL of the BookWyrm API (overrides BOOKWYRM_API_URL env var) --api-key TEXT API key for authentication (overrides BOOKWYRM_API_KEY env var) -v, --verbose Show detailed citation information --long Show full citation text without truncation --timeout FLOAT Request timeout in seconds (default: no timeout)
## Output Formats - **JSON (non-streaming)**: Array of citation objects - **JSONL (streaming)**: One citation per line as they're found - For multiple questions, citations include question index and text Usage:#### classify Classify files to determine their type and format. This command analyzes files, URLs, or stdin content to determine their format type, content type, MIME type, and other classification details. ## Classification Includes - **Format type** (text, image, binary, archive, etc.) - **Content type** (python_code, json_data, jpeg_image, etc.) - **MIME type** detection - **Confidence score** (0.0-1.0) - **Additional details** (encoding, language, etc.) ## Examples ```bash # Classify local file bookwyrm classify --file document.pdf # Classify from URL bookwyrm classify --url https://example.com/file.dat # Classify from stdin echo "import pandas as pd" | bookwyrm classify --filename script.py # With output file bookwyrm classify --file unknown_file.bin --output classification.json # With filename hint bookwyrm classify --file data.txt --filename "research_data.csv" --output results.json
Output Format¶
JSON file containing classification results, file size, and sample preview
Usage:
Options:
--url TEXT URL to classify
--file PATH File to classify
--filename TEXT Optional filename hint for classification
-o, --output PATH Output file for classification results (JSON format)
--base-url TEXT Base URL of the BookWyrm API (overrides BOOKWYRM_API_URL
env var)
--api-key TEXT API key for authentication (overrides BOOKWYRM_API_KEY
env var)
-v, --verbose Show detailed information
extract-pdf¶
Extract structured data from PDF files using OCR.
This command extracts text elements from PDF files with position coordinates, confidence scores, and bounding box information. It supports both local files and remote URLs, with optional page range selection.
Features¶
- OCR-based text extraction with confidence scores
- Bounding box coordinates for each text element
- Page range selection (start_page + num_pages)
- Language support for OCR processing (default: English)
- Streaming progress updates
- Support for both local files and URLs
Advanced Processing Features¶
- Layout Detection:
--enable-layout-detectionfor better text structure - Table Recognition:
--enable-table-recognitionfor table extraction - Formula Recognition:
--enable-formula-recognitionfor math formulas - Seal Recognition:
--enable-seal-recognitionfor stamps and seals - Chart Parsing:
--enable-chart-parsingfor graphs and charts - Document Preprocessing:
--enable-document-preprocessingfor better OCR - Model Selection:
--use-lightweight-models(default) vs--use-full-models - Time Limits:
--max-processing-timeto set processing timeouts
Page Selection¶
start_page: 1-based page number to begin extractionnum_pages: Number of pages to process from start_page- Omit both to process entire document
Examples¶
# Extract entire PDF
bookwyrm extract-pdf document.pdf --output extracted.json
# Extract specific pages
bookwyrm extract-pdf large_doc.pdf --start-page 5 --num-pages 10 --output pages_5_14.json
# Extract from URL
bookwyrm extract-pdf --url https://example.com/document.pdf --output extracted.json
# Extract with specific language
bookwyrm extract-pdf document.pdf --lang fr --output extracted.json
# Advanced processing with layout detection
bookwyrm extract-pdf document.pdf --layout --output advanced_extracted.json
# Force OCR for better text quality (without layout detection)
bookwyrm extract-pdf document.pdf --force-ocr --output force_ocr_extracted.json
# Verbose output
bookwyrm extract-pdf document.pdf -v --output extracted.json
# Auto-save with generated filename (no --output needed)
bookwyrm extract-pdf my_document.pdf --start-page 5 --num-pages 3
# Saves to: my_document_pages_5-7_extracted.json
Output Format¶
JSON file containing pages array with text elements, coordinates, and metadata
Usage:
Options:
[PDF_FILE] PDF file to extract from (optional if using --file or
--url)
--url TEXT PDF URL to extract from
--file PATH PDF file to extract from
-o, --output PATH Output file for extracted data (JSON format)
--start-page INTEGER 1-based page number to start from
--num-pages INTEGER Number of pages to process from start_page
--lang TEXT Language code for OCR processing (default: en)
\[default: en]
--layout Enable advanced layout detection for better text
structure analysis
--force-ocr Force use of OCR endpoint even for native text PDFs
(auto-enabled with --layout)
--timeout FLOAT Request timeout in seconds (default: no timeout)
--base-url TEXT Base URL of the PDF extraction API (overrides
BOOKWYRM_API_URL env var)
--api-key TEXT API key for authentication (overrides BOOKWYRM_API_KEY
env var)
-v, --verbose Show detailed information
pdf-query-range¶
Query character ranges from PDF text mapping to get bounding boxes.
This command takes a character mapping JSON file (created by pdf-to-text) and returns the bounding boxes for a specified character range, grouped by page.
Examples¶
# Query characters 100-200 from mapping file
bookwyrm pdf-query-range data/heinrich_pages_1-4_mapping.json 100 200
# Save results to JSON file
bookwyrm pdf-query-range mapping.json 500 750 -o bounding_boxes.json
# Verbose output with detailed information
bookwyrm pdf-query-range mapping.json 0 100 -v
Output Format¶
Returns bounding boxes grouped by page number, with character positions and coordinates
Usage:
Options:
MAPPING_FILE Character mapping JSON file from pdf-to-text command
\[required]
START_CHAR Starting character index (inclusive) \[required]
END_CHAR Ending character index (exclusive) \[required]
-o, --output PATH Output file for bounding box results (JSON format)
-v, --verbose Show detailed information
pdf-to-text¶
Convert PDF extraction JSON to raw text with character position mapping.
This command takes the JSON output from the extract-pdf command and converts it to: 1. Raw text file with all text elements joined by newlines 2. Character mapping JSON that maps each character position to its bounding box coordinates
The mapping accounts for inserted newlines by assigning them the bounding box of the preceding character.
Examples¶
# Convert PDF extraction to raw text
bookwyrm pdf-to-text data/heinrich_pages_1-4.json
# Specify output files
bookwyrm pdf-to-text data/extracted.json -o raw_text.txt --mapping char_map.json
# Verbose output
bookwyrm pdf-to-text data/extracted.json -v
Output Files¶
- Raw text file: All text elements joined with newlines
- Mapping JSON: Character index to bounding box coordinate mapping
Usage:
Options:
JSON_FILE JSON file from extract-pdf command \[required]
-o, --output PATH Output file for raw text (default: input_name_raw.txt)
--mapping PATH Output file for character mapping JSON (default:
input_name_mapping.json)
-v, --verbose Show detailed information
phrasal¶
Stream text processing using phrasal analysis to extract phrases or chunks.
This command breaks down text into meaningful phrases or chunks using NLP with real-time streaming results. It supports processing from direct text input, files, or URLs.
Response Formats¶
- with_offsets: Include character position information (start_char, end_char)
- text_only: Return only the text content without position data
Response Format Control¶
- Default: Include character position information (with_offsets)
- --text-only: Return only text content without position data
Chunking¶
Use --chunk-size to create chunks of approximately the specified character count.
Without --chunk-size, returns individual phrases.
Examples¶
# Process text directly
bookwyrm phrasal "Natural language processing is fascinating." -o phrases.jsonl
# Process file with position offsets (default behavior)
bookwyrm phrasal -f document.txt --output phrases.jsonl
# Create chunks of specific size (with position offsets by default)
bookwyrm phrasal -f large_text.txt --chunk-size 1000 --output chunks.jsonl
# Process from URL
bookwyrm phrasal --url https://example.com/text.txt --output phrases.jsonl
# Text only format using boolean flag
bookwyrm phrasal -f text.txt --text-only --output simple_phrases.jsonl
# Text-only format (no position data)
bookwyrm phrasal -f text.txt --text-only --output simple_phrases.jsonl
Output Format¶
JSONL file with one phrase/chunk per line:
Or for text-only format:
Usage:
Options:
[INPUT_TEXT] Text to process (optional if using --url or --file)
--url TEXT URL to fetch text from
-f, --file PATH File to read text from
-o, --output PATH Output file for phrases (JSONL format)
--chunk-size INTEGER Target size for each chunk (if not specified, returns
phrases individually)
--text-only Return text only without position data
--offsets Return text with position offsets (default behavior)
--base-url TEXT Base URL of the BookWyrm API (overrides
BOOKWYRM_API_URL env var)
--api-key TEXT API key for authentication (overrides BOOKWYRM_API_KEY
env var)
-v, --verbose Show detailed information
summarize¶
Summarize text content from JSONL files.
This command performs hierarchical summarization of text phrases, with support for structured output using Pydantic models and custom prompts.
Input Format¶
The JSONL file should contain phrases in this format:
Features¶
-
Structured Output: Use
--model-class-fileand--model-class-nameto generate structured summaries that conform to your Pydantic model schema. The output file is required when using structured output. -
Custom Prompts: Use
--chunk-promptand--summary-prompttogether to customize the summarization process. Both prompts are required when using custom prompts.
Examples¶
# Basic summarization (uses swift model by default)
bookwyrm summarize book_phrases.jsonl --output summary.json
# High-quality summarization
bookwyrm summarize book_phrases.jsonl --model-strength wise --output summary.json
# Maximum sophistication
bookwyrm summarize complex_text.jsonl --model-strength brainiac --output summary.json
# With debug information
bookwyrm summarize data.jsonl --include-debug --output detailed_summary.json
# Larger chunks with advanced reasoning
bookwyrm summarize large_text.jsonl --max-tokens 20000 --model-strength clever --output summary.json
# Structured output with Pydantic model
bookwyrm summarize book.jsonl --model-class-file models/book_summary.py --model-class-name BookSummary --model-strength smart --output structured_summary.json
# Custom prompts with high-quality model
bookwyrm summarize scientific_text.jsonl --chunk-prompt "Extract key scientific concepts and findings" --summary-prompt "Create a comprehensive scientific overview" --model-strength wise --output science_summary.json
Model Strength Levels¶
- swift: Fast processing for quick results
- smart: Intelligent analysis with good quality
- clever: Advanced reasoning capabilities
- wise: High-quality analysis for important content
- brainiac: Maximum sophistication for complex tasks
Output Format¶
JSON file containing summary, metadata, and optionally intermediate summaries
Usage:
Options:
JSONL_FILE JSONL file containing phrases \[required]
-o, --output PATH Output file for summary (JSON format)
--max-tokens INTEGER Maximum tokens per chunk (max: 131,072,
default: 10000) \[default: 10000]
--model-strength [swift|smart|clever|wise|brainiac]
Model strength level: swift (fast), smart
(intelligent), clever (advanced), wise
(high-quality), brainiac (maximum
sophistication) \[default: swift]
--include-debug Include intermediate summaries
--model-class-file PATH Python file containing Pydantic model class
for structured output
--model-class-name TEXT Name of the Pydantic model class to use
(required with --model-class-file)
--chunk-prompt TEXT Custom prompt for chunk summarization
(requires --summary-prompt)
--summary-prompt TEXT Custom prompt for summary of summaries
(requires --chunk-prompt)
--base-url TEXT Base URL of the BookWyrm API (overrides
BOOKWYRM_API_URL env var)
--api-key TEXT API key for authentication (overrides
BOOKWYRM_API_KEY env var)
-v, --verbose Show detailed information
Additional Information¶
Environment Variables¶
BOOKWYRM_API_KEY- Your BookWyrm API key (required)BOOKWYRM_API_URL- Base URL for the BookWyrm API (default: https://api.bookwyrm.ai:443)BOOKWYRM_PDF_API_URL- Base URL for PDF extraction API (falls back to BOOKWYRM_API_URL)
Global Options¶
All commands support these global options:
--base-url TEXT- Override the default API base URL--api-key TEXT- Provide API key (overrides BOOKWYRM_API_KEY env var)--version- Show version and exit--help- Show help message
Error Handling¶
The CLI provides helpful error messages and exit codes:
- Exit code 0: Success
- Exit code 1: Error (API error, file not found, invalid arguments, etc.)
Output Formats¶
Citations Output¶
JSON format (non-streaming):
[
{
"start_chunk": 0,
"end_chunk": 0,
"text": "Citation text here",
"reasoning": "Why this citation is relevant",
"quality": 3
}
]
JSONL format (streaming):
{"start_chunk": 0, "end_chunk": 0, "text": "Citation 1", "reasoning": "...", "quality": 3}
{"start_chunk": 1, "end_chunk": 1, "text": "Citation 2", "reasoning": "...", "quality": 4}
Summary Output¶
{
"summary": "The generated summary text or structured JSON",
"subsummary_count": 5,
"levels_used": 2,
"total_tokens": 15000,
"source_file": "input.jsonl",
"max_tokens": 10000,
"model_used": "BookSummary",
"intermediate_summaries": [["level 1 summaries"], ["level 2 summaries"]]
}