refactor: reorganize project structure and fix broken references

- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.) - Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml) - Move basedosdados.duckdb to data/ directory - Update Dockerfile and start.sh with new file paths - Update README.md with correct script paths - Remove Python ask.py (replaced by Rust binary in ask/ask) - Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs) - Remove sentence-transformer dependencies from ask - Move docs and context artifacts to their directories
2026-03-29 20:46:27 +02:00
parent 02cb13362c
commit ed5fa6756e
43 changed files with 302366 additions and 1093 deletions
--- a/ask/system_prompt.md
+++ b/ask/system_prompt.md
@@ -147,3 +147,68 @@ LIMIT 30
    if the question requires tables not in the provided DDL, OR
      If you cant generate a valid SQL, 
        answer as a JSON {error: "#{reason}"}
+
+
+## Common SQL Pitfalls & Debugging Strategy
+
+### 1. Column Propagation in CTEs (Most Common Error!)
+DuckDB requires explicit column selection in each CTE — columns from earlier CTEs are NOT automatically available in later CTEs.
+
+WRONG — `pop_2010` was not selected in `populacao` CTE:
+```sql
+WITH populacao AS (
+    SELECT id_municipio, sigla_uf  -- forgot populacao
+),
+fluxo AS (
+    SELECT p.pop_2010  -- error: pop_2010 not in p
+)
+```
+
+CORRECT — Select all columns needed in subsequent CTEs:
+```sql
+WITH populacao AS (
+    SELECT id_municipio, sigla_uf, pop_2010, pop_2022  -- explicit
+),
+fluxo AS (
+    SELECT p.pop_2010  -- works
+)
+```
+
+### 2. ALWAYS Verify Data Availability First
+Before running complex analyses, check:
+- Year range: `SELECT MIN(ano), MAX(ano) FROM dataset.table`
+- Record count: `SELECT COUNT(*) FROM dataset.table`
+- ID format compatibility between tables before JOIN
+
+### 3. Large Table Performance (>100M rows)
+- Tables like `br_cgu_beneficios_cidadao.novo_bolsa_familia` (588M+ records) WILL timeout
+- Strategy: Aggregate first with WHERE filters, then join
+- Use `LIMIT` when exploring to avoid long scans
+
+### 4. Lock Conflicts
+Multiple concurrent DuckDB queries on the same `.duckdb` file cause lock errors.
+- Wait between queries or use read-only mode
+
+### 5. UNION ALL Syntax
+DuckDB requires ORDER BY only at the very end of a UNION block, not in individual SELECTs.
+
+WRONG:
+```sql
+SELECT ... LIMIT 5
+ORDER BY x
+UNION ALL
+SELECT ... LIMIT 5
+ORDER BY y  -- error
+```
+
+CORRECT — Use subqueries or CTEs:
+```sql
+SELECT * FROM (SELECT ... ORDER BY x LIMIT 5) a
+UNION ALL
+SELECT * FROM (SELECT ... ORDER BY y LIMIT 5) b
+```
+
+### 6. String Values are LOWERCASE
+All categorical values (cargo, situacao, tipo, etc.) are stored in lowercase.
+Always use: `WHERE cargo = 'deputado federal'` not `'DEPUTADO FEDERAL'`
+