BookWyrm CLI Tutorial¶

This tutorial demonstrates the key capabilities of the BookWyrm CLI through practical examples using sample data files.

1. File Classification¶

First, let's classify a PDF file to understand its content type and structure:

# Classify the State-of-the-Art spacecraft technology (SOA) PDF to understand its content
# Download from: https://github.com/scidonia/bookwyrm-client/blob/main/data/SOA_2025_Final.pdf
bookwyrm classify --file data/SOA_2025_Final.pdf

This will analyze the PDF and return classification information including file type, content analysis, and structural metadata.

2. PDF Structure Extraction¶

Next, let's extract structured data from specific pages of the PDF:

# Extract structured JSON data from pages 1-4 of the SOA PDF
# Download from: https://github.com/scidonia/bookwyrm-client/blob/main/data/SOA_2025_Final.pdf
bookwyrm extract-pdf data/SOA_2025_Final.pdf --start-page 1 --num-pages 4 --output data/SOA_2025_Final_1-4.json

This creates a JSON file containing the structured text, bounding boxes, and layout information for the specified pages.

3. PDF to Text Conversion with Character Mapping¶

Convert the extracted PDF data to raw text with character position mapping:

# Convert PDF extraction to raw text with character mapping
bookwyrm pdf-to-text data/SOA_2025_Final_1-4.json --verbose

This creates two files:

data/SOA_2025_Final_1-4_raw.txt - Raw text with all PDF text elements joined by newlines
data/SOA_2025_Final_1-4_mapping.json - Character mapping that links each character position to its bounding box coordinates and page number

You can also specify custom output filenames:

# Convert with custom output filenames
bookwyrm pdf-to-text data/SOA_2025_Final_1-4.json \
  --output data/SOA_2025_Final_1-4_text.txt \
  --mapping data/SOA_2025_Final_1-4_char_map.json \
  --verbose

4. Querying Character Positions¶

Query specific character ranges to get their bounding box coordinates:

# Query characters 974-1089 to see their positions and bounding boxes
bookwyrm pdf-query-range data/SOA_2025_Final_1-4_mapping.json 974 1089 --verbose

This shows you:

Which pages contain the specified character range
Bounding box coordinates for each character
OCR confidence scores
Sample text from the range

Save the query results to a file:

# Save bounding box query results to JSON
bookwyrm pdf-query-range data/SOA_2025_Final_1-4_mapping.json 974 1089 \
  --output data/character_positions.json \
  --verbose

5. Phrasal Text Processing¶

Now let's process a text file to extract meaningful phrases and text spans:

# Create phrasal analysis of "The Country of the Blind" text
# Download from: https://github.com/scidonia/bookwyrm-client/blob/main/data/country-of-the-blind.txt
bookwyrm phrasal --file data/country-of-the-blind.txt --output data/country-of-the-blind-phrases.jsonl

This generates a JSONL file with text chunks and their positional information, suitable for further analysis.

6. Text Summarization¶

Let's create summaries from the phrasal data we just generated. BookWyrm supports both basic summarization and structured output using Pydantic models.

Basic Summarization¶

# Generate a basic summary from the Country of the Blind phrases
bookwyrm summarize data/country-of-the-blind-phrases.jsonl --output data/country-of-the-blind-summary.json --verbose

Structured Literary Analysis with Pydantic Models¶

BookWyrm supports structured output using custom Pydantic models, allowing you to extract specific information in a consistent format. This is particularly powerful for literary analysis, research, and data extraction tasks.

The Summary Model¶

The Summary model in data/summary.py demonstrates how to create a structured analysis model for literary works:

from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import date

class Summary(BaseModel):
    """Structured summary model for literary works."""

    title: Optional[str] = Field(
        None,
        description="The title of the literary work. Extract the exact title as it appears in the text, or infer it if clearly referenced.",
    )

    author: Optional[str] = Field(
        None,
        description="The author or authors of the literary work. Include full names when available, or partial names if that's all that's provided.",
    )

    date_of_publication: Optional[date] = Field(
        None,
        description="The publication date of the work in YYYY-MM-DD format. Use the earliest known publication date. If only a year is known, use January 1st of that year.",
    )

    plot: Optional[str] = Field(
        None,
        description="A comprehensive summary of the main plot, storyline, or narrative arc. Include key events, conflicts, and resolutions. For non-fiction, describe the main arguments or themes presented.",
    )

    timeline: Optional[str] = Field(
        None,
        description="The temporal setting or chronological framework of the work. This could include historical periods, fictional timelines, or the sequence of events. Describe when the story takes place or unfolds.",
    )

    important_characters: Optional[List[str]] = Field(
        None,
        description="A list of the most significant characters, people, or entities mentioned in the work. Include protagonists, antagonists, and other key figures. For non-fiction, include important historical figures or people discussed.",
    )

Using Structured Output¶

The key to structured output is the detailed description fields in each Pydantic Field. These descriptions act as instructions to the AI model, telling it exactly what information to extract and how to format it.

# Generate structured literary analysis using the Summary model
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --model-class-file data/summary.py \
  --model-class-name Summary \
  --model-strength smart \
  --output data/country-structured-summary.json \
  --verbose

This produces a structured JSON output with the specific fields defined in the Summary model:

title: The work's title
author: Author information
date_of_publication: Publication date in YYYY-MM-DD format
plot: Comprehensive plot summary
timeline: Temporal setting and chronology
important_characters: List of key characters and figures

Creating Custom Models¶

You can create your own Pydantic models for different types of analysis:

Scientific Paper Analysis:

class ScientificPaper(BaseModel):
    title: Optional[str] = Field(None, description="The paper's title")
    authors: Optional[List[str]] = Field(None, description="List of author names")
    abstract: Optional[str] = Field(None, description="The paper's abstract or summary")
    methodology: Optional[str] = Field(None, description="Research methods used")
    key_findings: Optional[List[str]] = Field(None, description="Main research findings")
    conclusions: Optional[str] = Field(None, description="Authors' conclusions")

Business Document Analysis:

class BusinessDocument(BaseModel):
    document_type: Optional[str] = Field(None, description="Type of business document")
    key_metrics: Optional[List[str]] = Field(None, description="Important numbers or KPIs mentioned")
    action_items: Optional[List[str]] = Field(None, description="Tasks or actions to be taken")
    stakeholders: Optional[List[str]] = Field(None, description="People or organizations involved")
    deadlines: Optional[List[str]] = Field(None, description="Important dates or deadlines")

Best Practices for Structured Output¶

Detailed Descriptions: Write clear, specific descriptions for each field
Optional Fields: Use Optional for fields that might not always be present
Appropriate Types: Use proper Python types (str, List[str], date, etc.)
Output File Required: Always specify --output when using structured models
Model Strength: Use smart, clever, or wise for better structured output quality

The structured output will look like:

{
  "summary": {
    "title": "The Country of the Blind",
    "author": "H.G. Wells",
    "date_of_publication": "1904-01-01",
    "plot": "A mountaineer named Nunez discovers an isolated valley...",
    "timeline": "Early 20th century, set in an isolated Andean valley...",
    "important_characters": ["Nunez", "Medina-saroté", "Yacob", "The Elders"]
  },
  "subsummary_count": 3,
  "levels_used": 2,
  "total_tokens": 1250,
  "source_file": "data/country-of-the-blind-phrases.jsonl",
  "model_used": "Summary"
}

Advanced Model Strengths¶

Different model strengths provide varying levels of analysis quality and processing time:

# High-quality literary analysis with the wise model
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --model-class-file data/summary.py \
  --model-class-name Summary \
  --model-strength wise \
  --output data/country-detailed-analysis.json

# Maximum sophistication for complex analysis
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --model-class-file data/summary.py \
  --model-class-name Summary \
  --model-strength brainiac \
  --output data/comprehensive-analysis.json

Model Strength Guide¶

swift: Fast processing for quick results (good for testing)
smart: Intelligent analysis with good quality (recommended default)
clever: Advanced reasoning capabilities (better for complex texts)
wise: High-quality analysis for important content (slower but thorough)
brainiac: Maximum sophistication for complex tasks (slowest but highest quality)

Choose higher model strengths for:

Complex literary works
Academic or research content
Important business documents
When accuracy is more important than speed

7. Citation Finding¶

Finally, let's find specific citations related to life-threatening situations in the story:

# Find citations about life-threatening situations the protagonist faces
bookwyrm cite data/country-of-the-blind-phrases.jsonl --question "Where does the protagonist experience life threatening situations?" --output data/protagonist-dangers.json --verbose --long

This searches through the text chunks to find relevant passages that answer the specific question about dangerous situations.

Complete PDF Processing Workflow¶

Here's a complete workflow for processing a PDF from extraction to position queries:

# Step 1: Extract PDF structure
bookwyrm extract-pdf data/SOA_2025_Final.pdf --start-page 1 --num-pages 4 --output data/SOA_2025_Final_extracted.json

# Step 2: Convert to raw text with character mapping
bookwyrm pdf-to-text data/SOA_2025_Final_extracted.json --verbose

# Step 3: Query specific character ranges
bookwyrm pdf-query-range data/SOA_2025_Final_extracted_mapping.json 0 100 --verbose

# Step 4: Query a larger range and save results
bookwyrm pdf-query-range data/SOA_2025_Final_extracted_mapping.json 1000 2000 --output data/positions_1000-2000.json

Additional Options¶

Streaming Output¶

For real-time processing feedback, add the --stream flag to most commands:

# Stream the summarization process
bookwyrm summarize data/country-of-the-blind-phrases.jsonl --stream --verbose

Multiple Questions¶

You can ask multiple citation questions at once:

# Ask multiple questions about the story
bookwyrm cite data/country-of-the-blind-phrases.jsonl \
  --question "What are the main conflicts in the story?" \
  --question "How does the protagonist adapt to his environment?" \
  --question "What role does blindness play in the narrative?" \
  --verbose

Custom Prompts for Summarization¶

Instead of using Pydantic models, you can provide custom prompts:

# Use custom prompts for specialized analysis
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --chunk-prompt "Extract key themes, symbols, and literary devices from this text" \
  --summary-prompt "Create a comprehensive literary analysis focusing on themes, symbolism, and narrative techniques" \
  --model-strength clever \
  --output data/country-literary-analysis.json

Debug Mode¶

Use --include-debug to see intermediate summaries:

# Run with debug information to see intermediate summaries
bookwyrm summarize data/country-of-the-blind-phrases.jsonl \
  --model-class-file data/summary.py \
  --model-class-name Summary \
  --include-debug \
  --output data/country-debug-summary.json

Expected Output Files¶

After running these commands, you should have:

data/SOA_2025_Final_1-4.json - Structured PDF data with bounding boxes
data/SOA_2025_Final_1-4_raw.txt - Raw text extracted from PDF
data/SOA_2025_Final_1-4_mapping.json - Character position to bounding box mapping
data/character_positions.json - Query results for specific character ranges
data/country-of-the-blind-phrases.jsonl - Phrasal analysis
data/country-of-the-blind-summary.json - Basic text summary
data/country-structured-summary.json - Structured literary analysis using Summary model
data/country-detailed-analysis.json - High-quality structured analysis
data/protagonist-dangers.json - Citation results

These files demonstrate the full pipeline from raw documents to structured insights using the BookWyrm API, including the ability to map text positions back to their original locations in PDF documents.

Use Cases for Character Mapping¶

The character mapping functionality enables several powerful use cases:

Citation Highlighting: Find citations in text, then highlight the exact regions in the original PDF
Search Result Visualization: Show users exactly where search results appear in the document
Annotation Systems: Allow users to annotate text and map annotations back to PDF coordinates
Quality Analysis: Analyze OCR confidence scores for specific text regions
Layout Analysis: Understand how text flows across pages and identify reading order