Design Decisions

This document explains the "why" behind key architectural and implementation choices in shh.

Architecture Decisions

Why Layered Architecture?

Decision: Separate CLI, Core, and Adapters layers with unidirectional dependencies.

Rationale:

Testability: Core logic can be tested without mocking external APIs
Flexibility: Could swap Typer for argparse, or add a web UI
Clarity: Clear separation of concerns makes code easier to understand
Maintenance: Changes to UI don't affect business logic

Trade-offs:

✅ Easier to test and maintain
✅ Framework-independent core
⚠️ More boilerplate (but not much in this small app)

Alternatives considered:

Single-file script: Too messy for a production app
Full hexagonal architecture: Overkill for a CLI tool

Why Pragmatic vs. Pure Clean Architecture?

Decision: Orchestration happens in CLI layer, not Core layer.

Rationale:

shh is small (< 1000 lines of code)
Orchestration logic is tightly coupled to CLI UX
Moving orchestration to Core would add complexity without benefit

When to move to Core:

If we add a web UI (need shared orchestration)
If orchestration logic becomes complex
If we need to test orchestration without CLI

Current approach:

# CLI layer handles orchestration (pragmatic)
async def record_command(...):
    audio = await record_audio()
    text = await transcribe_audio(audio)
    formatted = await format_transcription(text)
    await copy_to_clipboard(formatted)

Pure approach (not used):

# Core layer handles orchestration (overkill for now)
class TranscriptionService:
    def __init__(self, whisper, llm, clipboard):
        ...

    async def transcribe(self, audio):
        text = await self.whisper.transcribe(audio)
        formatted = await self.llm.format(text)
        await self.clipboard.copy(formatted)
        return formatted

Technology Choices

Why Typer over argparse?

Decision: Use Typer for CLI framework.

Rationale:

Type hints: Typer uses Python type hints for argument parsing
Automatic help: Generates beautiful help messages
Subcommands: Clean syntax for command groups (config show, config set)
Rich integration: Works seamlessly with Rich for terminal UI

Trade-offs:

✅ Less boilerplate than argparse
✅ Better UX (help messages, validation)
⚠️ Extra dependency (but lightweight)

Alternatives considered:

argparse: Built-in but verbose and less ergonomic
click: Popular but less type-safe than Typer

Why Rich for Terminal UI?

Decision: Use Rich for all terminal output.

Rationale:

Beautiful output: Colors, tables, panels, progress bars
Live updates: Real-time progress display while recording
Minimal effort: Simple API for complex formatting
Professional look: Makes the CLI feel polished

Examples:

# Rich table
table = Table(title="Configuration")
table.add_column("Setting", style="cyan")
console.print(table)

# Rich panel
panel = Panel("Success!", title="Setup Complete", border_style="green")
console.print(panel)

# Rich live display
with Live(auto_refresh=False) as live:
    live.update(Text("Recording... 12.3s"))

Alternatives considered:

Plain print(): Works but looks amateur
colorama: Colors only, no tables or live updates
blessed: More complex API

Why PydanticAI over LangChain?

Decision: Use PydanticAI for LLM formatting.

Rationale:

Structured outputs: Pydantic models ensure valid responses
Type safety: Full type hints, integrates with mypy
Simplicity: Less complex than LangChain for this use case
Modern: Built by Pydantic team, first-class async support

Example:

class FormattedTranscription(BaseModel):
    text: str

agent = Agent(model, result_type=FormattedTranscription)
result = await agent.run(f"Format this: {text}")
# result.output.text is guaranteed to be a string

Trade-offs:

✅ Simpler than LangChain for our needs
✅ Type-safe outputs
⚠️ Newer library (less mature)

Alternatives considered:

LangChain: Too heavyweight, complex API
Direct OpenAI SDK: No structured outputs

Why sounddevice over pyaudio?

Decision: Use sounddevice for audio recording.

Rationale:

Cross-platform: Works on macOS, Linux, Windows
NumPy integration: Returns audio as NumPy arrays
Modern: Active development, Python 3+ focused
Simple API: Easy to use for basic recording

Example:

import sounddevice as sd

# Record audio
audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
sd.wait()  # Wait until recording is finished

Trade-offs:

✅ Simpler than pyaudio
✅ Better documentation
⚠️ Requires PortAudio system library

Alternatives considered:

pyaudio: More complex, less Pythonic API
python-sounddevice: Same as sounddevice (it's an alias)

Implementation Patterns

Why Async/Await?

Decision: Use async/await for all I/O operations.

Rationale:

Non-blocking: API calls don't freeze the UI
Responsive UX: Live progress updates while recording
Modern Python: async/await is the standard for I/O-bound tasks

Where async is used:

Recording audio (async context manager)
API calls (Whisper, GPT)
Clipboard operations
File I/O (when possible)

Bridge to sync:

# Typer callbacks are sync, bridge to async backend
def default_command(...):
    asyncio.run(record_command(...))

Trade-offs:

✅ Better UX (non-blocking operations)
✅ Modern Python patterns
⚠️ Slightly more complex (async/await syntax)

Why Press Enter (not Ctrl+C) to Stop?

Decision: Use Enter key to stop recording, not Ctrl+C.

Rationale:

Intuitive: Enter is a natural "done" signal
No signal handling: Ctrl+C sends SIGINT, complicates error handling
Graceful shutdown: Enter allows clean async cancellation
User testing: Felt more natural than Ctrl+C

Implementation:

async def wait_for_enter():
    loop = asyncio.get_running_loop()
    # Run blocking stdin.readline in thread pool
    await loop.run_in_executor(None, sys.stdin.readline)

enter_task = asyncio.create_task(wait_for_enter())
while not enter_task.done():
    # Continue recording
    await asyncio.sleep(0.1)

Trade-offs:

✅ Cleaner async code
✅ More intuitive UX
⚠️ Different from typical CLI tools (which use Ctrl+C)

Alternative considered:

Ctrl+C: More common but requires signal handling and is less graceful

Why Temporary Files for Audio?

Decision: Save audio to temporary WAV files, delete immediately after transcription.

Rationale:

Whisper API requirement: Requires file upload (not raw bytes)
Disk space: Auto-cleanup prevents accumulation
Security: Temporary files are automatically cleaned up

Implementation:

try:
    wav_path = save_audio_to_wav(audio_data)
    text = await transcribe_audio(wav_path, api_key)
finally:
    wav_path.unlink(missing_ok=True)  # Always delete

Trade-offs:

✅ Required by API
✅ Auto-cleanup prevents disk bloat
⚠️ Temporary I/O overhead (minimal)

Alternative considered:

Persistent files: Would require manual cleanup or config for storage location

Why Enum for TranscriptionStyle?

Decision: Use Python Enum for style choices.

Rationale:

Type safety: Can't pass invalid styles
IDE autocomplete: Enum members appear in autocomplete
Validation: Automatic validation in Pydantic models
Self-documenting: All valid values in one place

Example:

class TranscriptionStyle(str, Enum):
    NEUTRAL = "neutral"
    CASUAL = "casual"
    BUSINESS = "business"

# Type-safe usage
style = TranscriptionStyle.CASUAL

# Pydantic validates automatically
class Settings(BaseModel):
    default_style: TranscriptionStyle

Trade-offs:

✅ Type-safe
✅ Validated automatically
⚠️ Slightly more verbose than plain strings

Alternative considered:

Plain strings: Easier but error-prone (typos not caught)

Why pydantic-settings for Configuration?

Decision: Use pydantic-settings for config management.

Rationale:

Type safety: Settings are typed and validated
Environment variables: Automatic parsing with SHH_ prefix
Multiple sources: Supports env vars, JSON files, defaults
Validation: Automatic validation on load
Platform-agnostic: Works across macOS, Linux, Windows

Example:

class Settings(BaseSettings):
    openai_api_key: str = ""
    default_style: TranscriptionStyle = TranscriptionStyle.NEUTRAL

    class Config:
        env_prefix = "SHH_"  # SHH_OPENAI_API_KEY

Trade-offs:

✅ Type-safe configuration
✅ Automatic validation
⚠️ Extra dependency (but small)

Alternatives considered:

ConfigParser: Built-in but less type-safe
python-dotenv: Only handles .env files
YAML: Requires extra library, more complex

Testing Decisions

Why No E2E Tests?

Decision: No end-to-end tests with real APIs in CI.

Rationale:

Cost: Real API calls cost money (OpenAI charges per request)
Speed: API calls are slow, would make CI sluggish
Reliability: External APIs can fail, causing flaky tests
Coverage: Integration tests with mocks provide 80%+ coverage

What we test instead:

Unit tests: Core logic, isolated functions
Integration tests: Full pipeline with mocked APIs

Example:

# Integration test with mocked Whisper API
with patch("shh.adapters.whisper.client.AsyncOpenAI") as mock:
    mock_instance = mock.return_value
    mock_instance.audio.transcriptions.create = AsyncMock(
        return_value=MagicMock(text="Hello world")
    )

    result = await transcribe_audio(wav_path, "sk-test-key")
    assert result == "Hello world"

Trade-offs:

✅ Fast CI (no API waits)
✅ Free (no API costs)
⚠️ Doesn't catch API changes (mitigated by integration tests)

When to add E2E:

If we see production bugs not caught by mocks
For release validation (not in regular CI)

Future Considerations

When to Add a Database?

Currently: Settings stored in JSON file.

Add a database when:

We need to store transcription history
We add user accounts or multi-user support
We need search or querying of past transcriptions

Likely choice: SQLite (local, no server needed)

When to Add a Web UI?

Currently: CLI only.

Add a web UI when:

Users request it (not yet)
We need remote access (record on server, access from browser)

Architecture impact:

Move orchestration from CLI to Core layer
Add FastAPI or Flask adapter layer
Shared Core logic between CLI and Web

When to Support Local Whisper?

Currently: OpenAI API only (requires internet and API key).

Add local Whisper when:

Users request offline support
Privacy concerns arise (local processing)

Implementation:

Add LocalWhisperAdapter implementing same interface
Configuration setting to choose API vs. Local
Download model on first run

Next Steps

Architecture Overview - High-level architecture
Testing Architecture - Testing strategy and patterns
API Reference - Detailed code documentation