From c93c473e38e687dd03e7a7f2b57cd57569c0dc39 Mon Sep 17 00:00:00 2001
From: rafapolo <rafael@extrapolo.com>
Date: Sun, 17 May 2026 11:24:38 +0200
Subject: [PATCH] Prefer remote query endpoint over local DuckDB in CLAUDE.md

Update conventions and commands to reflect that db.xn--2dk.xyz/query
is now the primary interface for SQL queries. Also fix haloy deploy flag
(-c instead of -f) and update auth.py description.
---
 CLAUDE.md | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)
 create mode 100644 CLAUDE.md

diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..4ccf119
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,108 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+**baseldosdados** mirrors the [Base dos Dados](https://basedosdados.org) project — 533 public BigQuery tables exported as Parquet+zstd to Hetzner Object Storage (S3-compatible). DuckDB queries the data on-demand without local imports. An AI-powered TUI converts Portuguese natural language to SQL.
+
+## Commands
+
+### Rust (`ask/` and `dbquery/`)
+```bash
+cd ask && cargo build --release        # build TUI
+cd ask && cargo build                  # dev build
+cd ask && cargo test                   # run tests
+./ask/target/release/ask               # interactive TUI
+./ask/target/release/ask "Quantos municípios tem SP?"  # CLI mode
+cd dbquery && cargo build --release
+```
+
+### Python services
+```bash
+python auth.py                         # auth + query HTTP server on :8081
+python scripts/prepara_db.py          # generate DuckDB with views
+python scripts/gera_schemas.py        # extract table schemas → JSON/text
+```
+
+### DuckDB
+```bash
+duckdb data/basedosdados.duckdb       # interactive shell (requires S3 env vars)
+```
+
+### Data export pipeline
+```bash
+./scripts/roda.sh --dry-run           # estimate costs, no writes
+./scripts/roda.sh                     # run locally (needs gcloud + rclone)
+./scripts/roda.sh --gcloud-run        # spin up GCP VM and run there
+```
+
+### Querying data (preferred)
+```bash
+# Use the remote endpoint — collocated with S3, persistent connection, faster than local
+curl "https://db.xn--2dk.xyz/query?q=SELECT+..." -H "X-Password: $BASIC_AUTH_PASSWORD"
+# or POST for longer queries
+curl -X POST "https://db.xn--2dk.xyz/query" -H "X-Password: $BASIC_AUTH_PASSWORD" --data-raw "SELECT ..."
+```
+
+### Docker / deployment
+```bash
+docker build -t baseldosdados .       # multi-stage build (Rust + Python)
+haloy deploy -c haloy.yml             # deploy via haloy
+```
+
+## Architecture
+
+### Services (started by `start.sh`)
+| Port | Service | Purpose |
+|------|---------|---------|
+| 7681 | ttyd → duckdb | Browser-accessible DuckDB shell |
+| 7682 | ttyd → ask | Browser-accessible NL→SQL TUI |
+| 8081 | auth.py | Cookie auth + SQL execution proxy |
+| 8080 | Caddy | Public reverse proxy, forward auth |
+
+Caddy routes by hostname: `ask.xn--2dk.xyz` → port 7682, `db.xn--2dk.xyz` → port 7681. The `/query` endpoint on `db.xn--2dk.xyz` is unauthenticated for read-only SQL via HTTP.
+
+### `ask/` — Natural Language → SQL (Rust)
+- `src/main.rs` — TUI entry point, ratatui/crossterm event loop
+- `src/sql_generator.rs` — LLM backends: Gemini, OpenRouter, Ollama (sqlcoder)
+- `src/table_selector.rs` — semantic table selection from embeddings before prompting
+- `src/schema_filter.rs` — trims schema to relevant tables (controlled by `TOP_K_TABLES`)
+
+LLM backend is selected by `SQL_GENERATOR` env var (`gemini`/`openrouter`/`sqlcoder`). Schema metadata lives in `context/`.
+
+### `auth.py` — Auth & Query Service
+HMAC-SHA256 cookie auth. Holds a **persistent DuckDB Python connection** (in-memory + ATTACH read-only) initialized once at startup with S3 credentials and httpfs. Returns JSON. Use `X-Password` header matching `BASIC_AUTH_PASSWORD`.
+
+### Data flow
+BigQuery → Google Cloud Storage (Parquet) → Hetzner S3 (via `scripts/roda.sh` + rclone) → DuckDB httpfs reads on query.
+
+### `context/` — Schema metadata
+- `basedosdados-schema.json` — full schema (3.8 MB), used by `ask`
+- `schema_compact.txt` — text format for prompting
+- `table_embeddings.json` — semantic vectors for table selection (11.4 MB)
+- `join_keys.md` — foreign key relationships across datasets
+
+## Environment Variables
+
+| Variable | Used by | Purpose |
+|----------|---------|---------|
+| `SQL_GENERATOR` | ask | LLM backend (`gemini`/`openrouter`/`sqlcoder`) |
+| `GEMINI_API_KEY` | ask | Google Gemini API key |
+| `OPENROUTER_API_KEY` | ask | OpenRouter API key |
+| `TOP_K_TABLES` | ask | Tables passed to LLM (default: 5) |
+| `BASIC_AUTH_PASSWORD` | auth.py, Caddy | Web UI password |
+| `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` | auth.py, duckdb | Hetzner S3 credentials |
+| `HETZNER_S3_ENDPOINT` | auth.py, duckdb | S3 endpoint URL |
+| `BUCKET_REGION` | auth.py | S3 bucket region |
+| `OLLAMA_HOST` / `OLLAMA_MODEL` | ask | Local Ollama config |
+
+## Key Conventions
+
+- **Never use GCP, BigQuery, or `bq` CLI for queries** — all data access goes through DuckDB only.
+- **Prefer the remote endpoint** `https://db.xn--2dk.xyz/query` for all SQL queries — it is collocated with Hetzner S3 and has a persistent warmed connection. Local DuckDB is only a fallback when the server is unreachable.
+- DuckDB always runs read-only; no writes to the database from queries.
+- Queries on large tables must filter on partition columns (`ano`, `mes`, `sigla_uf`) — this is enforced in prompts.
+- SQL dialect is DuckDB; BigQuery syntax does not apply.
+- `overview/` contains per-dataset markdown summaries used as LLM context.
+- `queries/` contains example SQL and CNAE audit analysis files.