Skip to content

BookWyrm CLI Tutorial

This tutorial demonstrates the key capabilities of the BookWyrm CLI through practical examples using sample data files.

1. File Classification

First, let's classify a PDF file to understand its content type and structure:

# Classify the State-of-the-Art spacecraft technology (SOA) PDF to understand its content
# Download from: https://github.com/scidonia/bookwyrm-client/blob/main/data/SOA_2025_Final.pdf
bookwyrm classify --file data/SOA_2025_Final.pdf

This will analyze the PDF and return classification information including file type, content analysis, and structural metadata.

2. PDF Structure Extraction

Next, let's extract structured data from specific pages of the PDF:

# Extract structured JSON data from pages 1-4 of the SOA PDF
# Download from: https://github.com/scidonia/bookwyrm-client/blob/main/data/SOA_2025_Final.pdf
bookwyrm extract-pdf data/SOA_2025_Final.pdf --start-page 1 --num-pages 4 --output data/SOA_2025_Final_1-4.json

This creates a JSON file containing the structured text, bounding boxes, and layout information for the specified pages.

3. PDF to Text Conversion with Character Mapping

Convert the extracted PDF data to raw text with character position mapping:

# Convert PDF extraction to raw text with character mapping
bookwyrm pdf-to-text data/SOA_2025_Final_1-4.json --verbose

This creates two files:

  • data/SOA_2025_Final_1-4_raw.txt - Raw text with all PDF text elements joined by newlines
  • data/SOA_2025_Final_1-4_mapping.json - Character mapping that links each character position to its bounding box coordinates and page number

You can also specify custom output filenames:

# Convert with custom output filenames
bookwyrm pdf-to-text data/SOA_2025_Final_1-4.json \
  --output data/SOA_2025_Final_1-4_text.txt \
  --mapping data/SOA_2025_Final_1-4_char_map.json \
  --verbose

4. Querying Character Positions

Query specific character ranges to get their bounding box coordinates:

# Query characters 974-1089 to see their positions and bounding boxes
bookwyrm pdf-query-range data/SOA_2025_Final_1-4_mapping.json 974 1089 --verbose

This shows you:

  • Which pages contain the specified character range
  • Bounding box coordinates for each character
  • OCR confidence scores
  • Sample text from the range

Save the query results to a file:

# Save bounding box query results to JSON
bookwyrm pdf-query-range data/SOA_2025_Final_1-4_mapping.json 974 1089 \
  --output data/character_positions.json \
  --verbose

5. Phrasal Text Processing

Now let's process a text file to extract meaningful phrases and text spans:

# Create phrasal analysis of "The Country of the Blind" text
# Download from: https://github.com/scidonia/bookwyrm-client/blob/main/data/country-of-the-blind.txt
bookwyrm phrasal --file data/country-of-the-blind.txt --output data/country-of-the-blind-phrases.jsonl

This generates a JSONL file with text chunks and their positional information, suitable for further analysis.

6. Text Summarization

Let's create summaries from the phrasal data we just generated. BookWyrm supports both basic summarization and structured output using Pydantic models.

Basic Summarization

# Generate a basic summary from the Country of the Blind phrases
bookwyrm summarize data/country-of-the-blind-phrases.jsonl --output data/country-of-the-blind-summary.json --verbose

Structured Literary Analysis with Pydantic Models

BookWyrm supports structured output using custom Pydantic models, allowing you to extract specific information in a consistent format. This is particularly powerful for literary analysis, research, and data extraction tasks.

The Summary Model

The Summary model in data/summary.py demonstrates how to create a structured analysis model for literary works:

from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import date

class Summary(BaseModel):
    """Structured summary model for literary works."""

    title: Optional[str] = Field(
        None,
        description="The title of the literary work. Extract the exact title as it appears in the text, or infer it if clearly referenced.",
    )

    author: Optional[str] = Field(
        None,
        description="The author or authors of the literary work. Include full names when available, or partial names if that's all that's provided.",
    )

    date_of_publication: Optional[date] = Field(
        None,
        description="The publication date of the work in YYYY-MM-DD format. Use the earliest known publication date. If only a year is known, use January 1st of that year.",
    )

    plot: Optional[str] = Field(
        None,
        description="A comprehensive summary of the main plot, storyline, or narrative arc. Include key events, conflicts, and resolutions. For non-fiction, describe the main arguments or themes presented.",
    )

    timeline: Optional[str] = Field(
        None,
        description="The temporal setting or chronological framework of the work. This could include historical periods, fictional timelines, or the sequence of events. Describe when the story takes place or unfolds.",
    )

    important_characters: Optional[List[str]] = Field(
        None,
        description="A list of the most significant characters, people, or entities mentioned in the work. Include protagonists, antagonists, and other key figures. For non-fiction, include important historical figures or people discussed.",
    )

Using Structured Output

The key to structured output is the detailed description fields in each Pydantic Field. These descriptions act as instructions to the AI model, telling it exactly what information to extract and how to format it.

# Generate structured literary analysis using the Summary model
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --model-class-file data/summary.py \
  --model-class-name Summary \
  --model-strength smart \
  --output data/country-structured-summary.json \
  --verbose

This produces a structured JSON output with the specific fields defined in the Summary model:

  • title: The work's title
  • author: Author information
  • date_of_publication: Publication date in YYYY-MM-DD format
  • plot: Comprehensive plot summary
  • timeline: Temporal setting and chronology
  • important_characters: List of key characters and figures

Creating Custom Models

You can create your own Pydantic models for different types of analysis:

Scientific Paper Analysis:

class ScientificPaper(BaseModel):
    title: Optional[str] = Field(None, description="The paper's title")
    authors: Optional[List[str]] = Field(None, description="List of author names")
    abstract: Optional[str] = Field(None, description="The paper's abstract or summary")
    methodology: Optional[str] = Field(None, description="Research methods used")
    key_findings: Optional[List[str]] = Field(None, description="Main research findings")
    conclusions: Optional[str] = Field(None, description="Authors' conclusions")

Business Document Analysis:

class BusinessDocument(BaseModel):
    document_type: Optional[str] = Field(None, description="Type of business document")
    key_metrics: Optional[List[str]] = Field(None, description="Important numbers or KPIs mentioned")
    action_items: Optional[List[str]] = Field(None, description="Tasks or actions to be taken")
    stakeholders: Optional[List[str]] = Field(None, description="People or organizations involved")
    deadlines: Optional[List[str]] = Field(None, description="Important dates or deadlines")

Best Practices for Structured Output

  1. Detailed Descriptions: Write clear, specific descriptions for each field
  2. Optional Fields: Use Optional for fields that might not always be present
  3. Appropriate Types: Use proper Python types (str, List[str], date, etc.)
  4. Output File Required: Always specify --output when using structured models
  5. Model Strength: Use smart, clever, or wise for better structured output quality

The structured output will look like:

{
  "summary": {
    "title": "The Country of the Blind",
    "author": "H.G. Wells",
    "date_of_publication": "1904-01-01",
    "plot": "A mountaineer named Nunez discovers an isolated valley...",
    "timeline": "Early 20th century, set in an isolated Andean valley...",
    "important_characters": ["Nunez", "Medina-saroté", "Yacob", "The Elders"]
  },
  "subsummary_count": 3,
  "levels_used": 2,
  "total_tokens": 1250,
  "source_file": "data/country-of-the-blind-phrases.jsonl",
  "model_used": "Summary"
}

Advanced Model Strengths

Different model strengths provide varying levels of analysis quality and processing time:

# High-quality literary analysis with the wise model
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --model-class-file data/summary.py \
  --model-class-name Summary \
  --model-strength wise \
  --output data/country-detailed-analysis.json

# Maximum sophistication for complex analysis
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --model-class-file data/summary.py \
  --model-class-name Summary \
  --model-strength brainiac \
  --output data/comprehensive-analysis.json

Model Strength Guide

  • swift: Fast processing for quick results (good for testing)
  • smart: Intelligent analysis with good quality (recommended default)
  • clever: Advanced reasoning capabilities (better for complex texts)
  • wise: High-quality analysis for important content (slower but thorough)
  • brainiac: Maximum sophistication for complex tasks (slowest but highest quality)

Choose higher model strengths for:

  • Complex literary works
  • Academic or research content
  • Important business documents
  • When accuracy is more important than speed

7. Citation Finding

Finally, let's find specific citations related to life-threatening situations in the story:

# Find citations about life-threatening situations the protagonist faces
bookwyrm cite data/country-of-the-blind-phrases.jsonl --question "Where does the protagonist experience life threatening situations?" --output data/protagonist-dangers.json --verbose --long

This searches through the text chunks to find relevant passages that answer the specific question about dangerous situations.

Complete PDF Processing Workflow

Here's a complete workflow for processing a PDF from extraction to position queries:

# Step 1: Extract PDF structure
bookwyrm extract-pdf data/SOA_2025_Final.pdf --start-page 1 --num-pages 4 --output data/SOA_2025_Final_extracted.json

# Step 2: Convert to raw text with character mapping
bookwyrm pdf-to-text data/SOA_2025_Final_extracted.json --verbose

# Step 3: Query specific character ranges
bookwyrm pdf-query-range data/SOA_2025_Final_extracted_mapping.json 0 100 --verbose

# Step 4: Query a larger range and save results
bookwyrm pdf-query-range data/SOA_2025_Final_extracted_mapping.json 1000 2000 --output data/positions_1000-2000.json

Additional Options

Streaming Output

For real-time processing feedback, add the --stream flag to most commands:

# Stream the summarization process
bookwyrm summarize data/country-of-the-blind-phrases.jsonl --stream --verbose

Multiple Questions

You can ask multiple citation questions at once:

# Ask multiple questions about the story
bookwyrm cite data/country-of-the-blind-phrases.jsonl \
  --question "What are the main conflicts in the story?" \
  --question "How does the protagonist adapt to his environment?" \
  --question "What role does blindness play in the narrative?" \
  --verbose

Custom Prompts for Summarization

Instead of using Pydantic models, you can provide custom prompts:

# Use custom prompts for specialized analysis
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --chunk-prompt "Extract key themes, symbols, and literary devices from this text" \
  --summary-prompt "Create a comprehensive literary analysis focusing on themes, symbolism, and narrative techniques" \
  --model-strength clever \
  --output data/country-literary-analysis.json

Debug Mode

Use --include-debug to see intermediate summaries:

# Run with debug information to see intermediate summaries
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --model-class-file data/summary.py \
  --model-class-name Summary \
  --include-debug \
  --output data/country-debug-summary.json

Expected Output Files

After running these commands, you should have:

  • data/SOA_2025_Final_1-4.json - Structured PDF data with bounding boxes
  • data/SOA_2025_Final_1-4_raw.txt - Raw text extracted from PDF
  • data/SOA_2025_Final_1-4_mapping.json - Character position to bounding box mapping
  • data/character_positions.json - Query results for specific character ranges
  • data/country-of-the-blind-phrases.jsonl - Phrasal analysis
  • data/country-of-the-blind-summary.json - Basic text summary
  • data/country-structured-summary.json - Structured literary analysis using Summary model
  • data/country-detailed-analysis.json - High-quality structured analysis
  • data/protagonist-dangers.json - Citation results

These files demonstrate the full pipeline from raw documents to structured insights using the BookWyrm API, including the ability to map text positions back to their original locations in PDF documents.

Use Cases for Character Mapping

The character mapping functionality enables several powerful use cases:

  1. Citation Highlighting: Find citations in text, then highlight the exact regions in the original PDF
  2. Search Result Visualization: Show users exactly where search results appear in the document
  3. Annotation Systems: Allow users to annotate text and map annotations back to PDF coordinates
  4. Quality Analysis: Analyze OCR confidence scores for specific text regions
  5. Layout Analysis: Understand how text flows across pages and identify reading order