Files
baseldosdados/CLAUDE.md
rafapolo c93c473e38 Prefer remote query endpoint over local DuckDB in CLAUDE.md
Update conventions and commands to reflect that db.xn--2dk.xyz/query
is now the primary interface for SQL queries. Also fix haloy deploy flag
(-c instead of -f) and update auth.py description.
2026-05-17 11:24:38 +02:00

5.0 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

baseldosdados mirrors the Base dos Dados project — 533 public BigQuery tables exported as Parquet+zstd to Hetzner Object Storage (S3-compatible). DuckDB queries the data on-demand without local imports. An AI-powered TUI converts Portuguese natural language to SQL.

Commands

Rust (ask/ and dbquery/)

cd ask && cargo build --release        # build TUI
cd ask && cargo build                  # dev build
cd ask && cargo test                   # run tests
./ask/target/release/ask               # interactive TUI
./ask/target/release/ask "Quantos municípios tem SP?"  # CLI mode
cd dbquery && cargo build --release

Python services

python auth.py                         # auth + query HTTP server on :8081
python scripts/prepara_db.py          # generate DuckDB with views
python scripts/gera_schemas.py        # extract table schemas → JSON/text

DuckDB

duckdb data/basedosdados.duckdb       # interactive shell (requires S3 env vars)

Data export pipeline

./scripts/roda.sh --dry-run           # estimate costs, no writes
./scripts/roda.sh                     # run locally (needs gcloud + rclone)
./scripts/roda.sh --gcloud-run        # spin up GCP VM and run there

Querying data (preferred)

# Use the remote endpoint — collocated with S3, persistent connection, faster than local
curl "https://db.xn--2dk.xyz/query?q=SELECT+..." -H "X-Password: $BASIC_AUTH_PASSWORD"
# or POST for longer queries
curl -X POST "https://db.xn--2dk.xyz/query" -H "X-Password: $BASIC_AUTH_PASSWORD" --data-raw "SELECT ..."

Docker / deployment

docker build -t baseldosdados .       # multi-stage build (Rust + Python)
haloy deploy -c haloy.yml             # deploy via haloy

Architecture

Services (started by start.sh)

Port Service Purpose
7681 ttyd → duckdb Browser-accessible DuckDB shell
7682 ttyd → ask Browser-accessible NL→SQL TUI
8081 auth.py Cookie auth + SQL execution proxy
8080 Caddy Public reverse proxy, forward auth

Caddy routes by hostname: ask.xn--2dk.xyz → port 7682, db.xn--2dk.xyz → port 7681. The /query endpoint on db.xn--2dk.xyz is unauthenticated for read-only SQL via HTTP.

ask/ — Natural Language → SQL (Rust)

  • src/main.rs — TUI entry point, ratatui/crossterm event loop
  • src/sql_generator.rs — LLM backends: Gemini, OpenRouter, Ollama (sqlcoder)
  • src/table_selector.rs — semantic table selection from embeddings before prompting
  • src/schema_filter.rs — trims schema to relevant tables (controlled by TOP_K_TABLES)

LLM backend is selected by SQL_GENERATOR env var (gemini/openrouter/sqlcoder). Schema metadata lives in context/.

auth.py — Auth & Query Service

HMAC-SHA256 cookie auth. Holds a persistent DuckDB Python connection (in-memory + ATTACH read-only) initialized once at startup with S3 credentials and httpfs. Returns JSON. Use X-Password header matching BASIC_AUTH_PASSWORD.

Data flow

BigQuery → Google Cloud Storage (Parquet) → Hetzner S3 (via scripts/roda.sh + rclone) → DuckDB httpfs reads on query.

context/ — Schema metadata

  • basedosdados-schema.json — full schema (3.8 MB), used by ask
  • schema_compact.txt — text format for prompting
  • table_embeddings.json — semantic vectors for table selection (11.4 MB)
  • join_keys.md — foreign key relationships across datasets

Environment Variables

Variable Used by Purpose
SQL_GENERATOR ask LLM backend (gemini/openrouter/sqlcoder)
GEMINI_API_KEY ask Google Gemini API key
OPENROUTER_API_KEY ask OpenRouter API key
TOP_K_TABLES ask Tables passed to LLM (default: 5)
BASIC_AUTH_PASSWORD auth.py, Caddy Web UI password
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY auth.py, duckdb Hetzner S3 credentials
HETZNER_S3_ENDPOINT auth.py, duckdb S3 endpoint URL
BUCKET_REGION auth.py S3 bucket region
OLLAMA_HOST / OLLAMA_MODEL ask Local Ollama config

Key Conventions

  • Never use GCP, BigQuery, or bq CLI for queries — all data access goes through DuckDB only.
  • Prefer the remote endpoint https://db.xn--2dk.xyz/query for all SQL queries — it is collocated with Hetzner S3 and has a persistent warmed connection. Local DuckDB is only a fallback when the server is unreachable.
  • DuckDB always runs read-only; no writes to the database from queries.
  • Queries on large tables must filter on partition columns (ano, mes, sigla_uf) — this is enforced in prompts.
  • SQL dialect is DuckDB; BigQuery syntax does not apply.
  • overview/ contains per-dataset markdown summaries used as LLM context.
  • queries/ contains example SQL and CNAE audit analysis files.