refactor: reorganize project structure and fix broken references

- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.) - Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml) - Move basedosdados.duckdb to data/ directory - Update Dockerfile and start.sh with new file paths - Update README.md with correct script paths - Remove Python ask.py (replaced by Rust binary in ask/ask) - Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs) - Remove sentence-transformer dependencies from ask - Move docs and context artifacts to their directories
2026-03-29 20:46:27 +02:00
parent 02cb13362c
commit ed5fa6756e
43 changed files with 302366 additions and 1093 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -3,6 +3,5 @@
 logs/
 done_tables.txt
 done_transfers.txt
-# CocoIndex Code (ccc)
-/.cocoindex_code/
 **/target
+*.log
--- a/5
+++ b/5
@@ -28,8 +28,9 @@ ENV PATH="/root/.cargo/bin:${PATH}" \

 WORKDIR /app

-COPY basedosdados.duckdb Caddyfile start.sh auth.py ask.py ./
-RUN chmod +x start.sh
+COPY data/basedosdados.duckdb shell/Caddyfile shell/auth.py start.sh ./
+COPY ask/ask /app/ask
+RUN chmod +x start.sh /app/ask

 EXPOSE 8080

--- a/README.md
+++ b/README.md
@@ -8,11 +8,13 @@ Os dados foram exportados do BigQuery para o Hetzner Object Storage (Helsinki) n

 ## Consultando os dados

-Acesso via browser ou curl, protegido por senha. Peça a senha para o administrador.
+Acesso via browser ou curl, protegido por senha - peça!

 ### Shell no browser

-Acesse **https://db.xn--2dk.xyz** → autentique → shell DuckDB interativo direto no browser.
+Acesse **https://db.ミ.xyz** → autentique → shell DuckDB interativo direto no browser.
+
+Use `.tables` para listar os datasets.

 ### SQL via curl

@@ -46,35 +48,6 @@ curl -s -X POST https://db.xn--2dk.xyz/query \
  --data-binary @query.sql > resultado.csv
 ```

-### Descobrindo tabelas
-
-```sql
-- listar todos os datasets (schemas)
-SHOW SCHEMAS;
-
-- listar tabelas de um dataset
-SHOW TABLES IN br_anatel_banda_larga_fixa;
-
-- ver colunas de uma tabela
-DESCRIBE br_anatel_banda_larga_fixa.densidade_brasil;
-```
-
-No shell do browser, `.tables` lista tudo de uma vez.
-
-### Exportar em CSV ou JSON
-
-O DuckDB permite formatar a saída diretamente na query:
-
-```sql
-- CSV com header (pipe para arquivo via curl)
-COPY (SELECT * FROM br_ibge_censo2022.municipios LIMIT 1000)
-TO '/dev/stdout' (FORMAT csv, HEADER true);
-
-- JSON
-SELECT * FROM br_ibge_censo2022.municipios LIMIT 10
-FORMAT JSON;
-```
-
 ---

 ## Exploração local
@@ -82,11 +55,11 @@ FORMAT JSON;
 Para rodar as queries na sua própria máquina com DuckDB instalado:

 ```bash
-python prepara_db.py   # gera basedosdados.duckdb com views apontando para o S3
-duckdb basedosdados.duckdb
+duckdb data/basedosdados.duckdb
 ```

 As queries são executadas diretamente sobre os arquivos Parquet no S3 — não há download de dados. O DuckDB lê os arquivos remotos sob demanda via `httpfs`.
+Precisa da credencial da .env - peça!

 ---

@@ -94,62 +67,52 @@ As queries são executadas diretamente sobre os arquivos Parquet no S3 — não

 Interface TUI que permite fazer perguntas em português e obter SQL automaticamente.

+### Arquitetura
+
+```
+Pergunta → [schema filtrado] → LLM local (sqlcoder) ou API externa
+         → SQL
+```
+
+1. **Schema filtrado**: As tabelas relevantes são filtradas e enviadas ao LLM
+2. **Geração SQL**: Modelo local (sqlcoder via Ollama) ou API externa (Gemini/OpenRouter)
+
 ### No browser

-Acesse **https://ask.xn--2dk.xyz** → autentique → digite sua pergunta em português.
+Acesse **https://ask.ミ.xyz** → autentique → digite sua pergunta em português.

 ### Local

 ```bash
+# Compilar
 cd ask
 cargo build --release
-./target/release/ask                    # modo interativo
-./target/release/ask "Quantos municípios tem SP?"  # modo CLI
+
+# Modo interativo (TUI)
+./target/release/ask
+
+# Modo CLI
+./target/release/ask "Quantos municípios tem SP?"
 ```

 ### Variáveis de ambiente

-| Variável | Descrição |
-|---|---|
-| `GEMINI_API_KEY` | Chave da API Gemini (obrigatória para usar modelos Gemini) |
-| `OPENROUTER_API_KEY` | Chave para usar modelos via OpenRouter |
-| `GEMINI_MODEL` | Modelo a usar (padrão: `gemini-flash-latest`) |
-| `SCHEMA_FILE` | Arquivo de schema (padrão: `context/schema_compact_inline.txt`) |
-| `DB_FILE` | Arquivo DuckDB (padrão: `basedosdados.duckdb`) |
+| Variável | Padrão | Descrição |
+|---|---|---|
+| `SQL_GENERATOR` | `gemini` | Generator: `sqlcoder`, `gemini`, ou `openrouter` |
+| `GEMINI_API_KEY` | — | Chave API Gemini (obrigatória se usar gemini) |
+| `OPENROUTER_API_KEY` | — | Chave API OpenRouter (obrigatória se usar openrouter) |
+| `GEMINI_MODEL` | `gemini-flash-lash` | Modelo Gemini |
+| `OPENROUTER_MODEL` | `openai/gpt-4o-mini` | Modelo OpenRouter |
+| `OLLAMA_MODEL` | `sqlcoder` | Modelo Ollama (sqlcoder ou sqlcoder:14b) |
+| `OLLAMA_HOST` | `http://localhost:11434` | Host Ollama |
+| `TOP_K_TABLES` | `5` | Número de tabelas a selecionar |
+| `SCHEMA_FILE` | `context/schema_compact_inline.txt` | Schema texto para fallback |
+| `SCHEMA_JSON` | `context/basedosdados-schema.json` | Schema JSON completo |
+| `DB_FILE` | `data/basedosdados.duckdb` | Arquivo DuckDB |

 ---

-## Arquivos de schema
-
-O diretório `context/` contém artefatos gerados automaticamente para contexto do LLM e descoberta de tabelas:
-
-| Arquivo | Descrição |
-|---|---|
-| `schema_compact_inline.txt` | Schema condensado para contexto do LLM |
-| `schema_compact.txt` | Schema mais verboso |
-| `schema_ddl.sql` | DDL das views DuckDB |
-| `join_graph.json` | Relacionamentos entre tabelas |
-| `file_tree.md` | Estrutura de arquivos no S3 com tamanhos |
-| `schemas.json` | Schema raw do BigQuery |
-
---
-
-## Descobrindo tabelas
-
-```sql
-- listar todos os datasets (schemas)
-SHOW SCHEMAS;
-
-- listar tabelas de um dataset
-SHOW TABLES IN br_anatel_banda_larga_fixa;
-
-- ver colunas de uma tabela
-DESCRIBE br_anatel_banda_larga_fixa.densidade_brasil;
-```
-
-No shell do browser, `.tables` lista tudo de uma vez. Para descoberta programática, use os arquivos em `context/`.
-
---

 ## Pipeline de exportação

@@ -172,8 +135,8 @@ Resume automático: se interrompido, basta rodar novamente.

 | Script | Função |
 |---|---|
-| `roda.sh` | Pipeline principal de exportação |
-| `prepara_db.py` | Gera `basedosdados.duckdb` com views para todas as tabelas |
+| `scripts/roda.sh` | Pipeline principal de exportação |
+| `scripts/prepara_db.py` | Gera `data/basedosdados.duckdb` com views para todas as tabelas |

 ### Configuração (`.env`)

@@ -196,10 +159,10 @@ Resume automático: se interrompido, basta rodar novamente.
 ### Executando

 ```bash
-chmod +x roda.sh
-./roda.sh --dry-run    # estima tamanho e custo
-./roda.sh              # execução local
-./roda.sh --gcloud-run # cria VM no GCP, roda lá e deleta ao final
+chmod +x scripts/roda.sh
+./scripts/roda.sh --dry-run    # estima tamanho e custo
+./scripts/roda.sh              # execução local
+./scripts/roda.sh --gcloud-run # cria VM no GCP, roda lá e deleta ao final
 ```

 Autenticação GCP necessária antes da primeira exportação:
@@ -219,8 +182,8 @@ Cria uma VM `e2-standard-4` Debian 12 em `us-central1-a`, copia o script e o `.e
 | `GCP_VM_NAME` | `bd-export-vm` | Nome da instância |
 | `GCP_VM_ZONE` | `us-central1-a` | Zona do Compute Engine |

-### Deploy do servidor
+### Deploy do servidor para serviços de db e ask

 ```bash
-haloy deploy
+haloy deploy -f shell/haloy.yml
 ```
--- a/ask.py
+++ b/ask.py
@@ -1,129 +0,0 @@
-#!/usr/bin/env python3
-"""
-ask.py — Send a Portuguese question to Gemini and get back SQL.
-
-Usage:
-    python ask.py "Quantos pedidos foram feitos por cliente no último mês?"
-    python ask.py "Qual a taxa de mortalidade infantil por município em 2020?"
-
-Env vars:
-    GEMINI_API_KEY  — required
-    SCHEMA_FILE     — path to DDL file (default: context/schema_compact_inline.txt)
-    GEMINI_MODEL    — model slug (default: gemini-2.0-flash-latest)
-"""
-
-import os
-import sys
-import json
-import requests
-import duckdb
-from dotenv import load_dotenv
-
-load_dotenv()
-
-SCHEMA_FILE = os.getenv("SCHEMA_FILE", "context/schema_compact_inline.txt")
-MODEL = os.getenv("GEMINI_MODEL", "gemini-flash-latest")
-DB_FILE = os.getenv("DB_FILE", "basedosdados.duckdb")
-
-
-def load_schema(path: str) -> str:
-    with open(path, "r", encoding="utf-8") as f:
-        return f.read()
-
-
-def ask(question: str) -> str:
-    api_key = os.getenv("GEMINI_API_KEY")
-    if not api_key:
-        sys.exit("Error: GEMINI_API_KEY not set")
-
-    schema_ddl = load_schema(SCHEMA_FILE)
-
-    system_prompt = (
-        "You are a SQL expert for Base dos Dados (basedosdados.org), "
-        "a Brazilian open data warehouse with tables accessed via DuckDB.\n\n"
-        "Rules:\n"
-        "- Use DuckDB syntax. Tables are referenced as dataset.table.\n"
-        "- Only use columns from the provided DDL — never invent column names.\n"
-        "- Add WHERE filters on ano, sigla_uf, or id_municipio whenever possible.\n"
-        "- Return ONLY the SQL query, no explanation, no markdown fences.\n\n"
-        f"Schema DDL:\n\n{schema_ddl}"
-    )
-
-    url = (
-        f"https://generativelanguage.googleapis.com/v1beta/models"
-        f"/{MODEL}:generateContent"
-    )
-
-    payload = {
-        "system_instruction": {
-            "parts": [{"text": system_prompt}]
-        },
-        "contents": [
-            {
-                "parts": [{"text": question}]
-            }
-        ]
-    }
-
-    response = requests.post(
-        url,
-        headers={
-            "Content-Type": "application/json",
-            "X-goog-api-key": api_key,
-        },
-        data=json.dumps(payload),
-        timeout=300,
-    )
-
-    response.raise_for_status()
-    result = response.json()
-
-    return result["candidates"][0]["content"]["parts"][0]["text"].strip()
-
-
-def main():
-    if len(sys.argv) < 2:
-        print(f"Usage: python {sys.argv[0]} \"<pergunta em português>\"", file=sys.stderr)
-        sys.exit(1)
-
-    question = " ".join(sys.argv[1:])
-    print(f"Question: {question}\n", file=sys.stderr)
-    print(f"Model:    {MODEL}\n", file=sys.stderr)
-
-    sql = ask(question)
-
-    print(f"\n── SQL ──────────────────────────────────────────\n{sql}\n", file=sys.stderr)
-
-    con = duckdb.connect(DB_FILE, read_only=True)
-    rel = con.sql(sql)
-
-    # box mode: build borders from column names + data
-    cols = rel.columns
-    rows = rel.fetchall()
-
-    if not rows:
-        print("(no rows returned)")
-        return
-
-    col_widths = [len(c) for c in cols]
-    for row in rows:
-        for i, val in enumerate(row):
-            col_widths[i] = max(col_widths[i], len(str(val) if val is not None else "NULL"))
-
-    def bar(left, mid, right, fill="─"):
-        return left + mid.join(fill * (w + 2) for w in col_widths) + right
-
-    header = "│" + "│".join(f" {c:{w}} " for c, w in zip(cols, col_widths)) + "│"
-
-    print(bar("┌", "┬", "┐"))
-    print(header)
-    print(bar("├", "┼", "┤"))
-    for row in rows:
-        vals = [str(v) if v is not None else "NULL" for v in row]
-        print("│" + "│".join(f" {v:{w}} " for v, w in zip(vals, col_widths)) + "│")
-    print(bar("└", "┴", "┘"))
-    print(f"\n{len(rows)} row(s)")
-
-
-if __name__ == "__main__":
-    main()
--- a/ask/.dockerignore
+++ b/ask/.dockerignore
@@ -0,0 +1 @@
+target
--- a/ask/Cargo.lock
+++ b/ask/Cargo.lock
@@ -252,6 +252,7 @@ dependencies = [
 "duckdb",
 "ratatui",
 "reqwest",
+ "serde",
 "serde_json",
 "syntect",
 "tui-textarea",
--- a/ask/Cargo.toml
+++ b/ask/Cargo.toml
@@ -9,6 +9,7 @@ path = "src/main.rs"

 [dependencies]
 reqwest    = { version = "0.12", features = ["blocking", "rustls-tls", "json"], default-features = false }
+serde      = { version = "1", features = ["derive"] }
 serde_json = "1"
 duckdb     = { version = "1", features = ["bundled"] }
 dotenvy    = "0.15"
--- a/ask/ask
+++ b/ask/ask
--- a/ask/src/main.rs
+++ b/ask/src/main.rs
@@ -1,4 +1,9 @@
+mod schema_filter;
+mod sql_generator;
+mod table_selector;
+
 use anyhow::{Context, Result};
+use chrono::Utc;
 use crossterm::{
    event::{
        DisableBracketedPaste, DisableMouseCapture, EnableBracketedPaste, EnableMouseCapture,
@@ -9,14 +14,12 @@ use crossterm::{
 };
 use duckdb::Connection;
 use ratatui::{
-    buffer::Buffer,
    layout::{Constraint, Direction, Layout, Rect},
    style::{Color, Modifier, Style},
    text::{Line, Span},
    widgets::{Block, Borders, Gauge, Paragraph, Row, Table, TableState, Wrap},
    Frame, Terminal,
 };
-use chrono::Utc;
 use serde_json::{json, Value};
 use std::{
    env, fs,
@@ -43,6 +46,10 @@ struct Config {
    schema: String,
    db_file: String,
    prompt_file: String,
+    use_table_selection: bool,
+    embeddings_file: String,
+    schema_json: String,
+    similarity_threshold: f32,
 }

 enum Phase {
@@ -234,10 +241,23 @@ fn spawn_worker(
    model: String,
    prompt_file: String,
    db_file: String,
+    use_table_selection: bool,
+    embeddings_file: String,
+    schema_json: String,
+    similarity_threshold: f32,
 ) -> mpsc::Receiver<WorkerMsg> {
    let (tx, rx) = mpsc::channel::<WorkerMsg>();
-    std::thread::spawn(
-        move || match ask_model(&question, &schema, &model, &prompt_file) {
+    std::thread::spawn(move || {
+        match ask_model_with_selection(
+            &question,
+            &schema,
+            &model,
+            &prompt_file,
+            use_table_selection,
+            &embeddings_file,
+            &schema_json,
+            similarity_threshold,
+        ) {
            Err(e) => {
                let err = format!("{:#}", e);
                log_question(&question, "", false, Some(&err));
@@ -257,8 +277,8 @@ fn spawn_worker(
                    }
                }
            }
-        },
-    );
+        }
+    });
    rx
 }

@@ -270,6 +290,10 @@ fn spawn_retry_worker(
    model: String,
    prompt_file: String,
    db_file: String,
+    use_table_selection: bool,
+    embeddings_file: String,
+    schema_json: String,
+    similarity_threshold: f32,
 ) -> mpsc::Receiver<WorkerMsg> {
    let retry_q = format!(
        "{}\n\nO SQL que você gerou falhou com este erro DuckDB:\n```\n{}\n```\n\n\
@@ -277,7 +301,17 @@ fn spawn_retry_worker(
         Corrija o SQL. Retorne APENAS o SQL corrigido, sem explicação.",
        question, error, failed_sql
    );
-    spawn_worker(retry_q, schema, model, prompt_file, db_file)
+    spawn_worker(
+        retry_q,
+        schema,
+        model,
+        prompt_file,
+        db_file,
+        use_table_selection,
+        embeddings_file,
+        schema_json,
+        similarity_threshold,
+    )
 }

 // ── event handling ────────────────────────────────────────────────────────────
@@ -327,6 +361,10 @@ impl App {
            self.config.model.clone(),
            self.config.prompt_file.clone(),
            self.config.db_file.clone(),
+            self.config.use_table_selection,
+            self.config.embeddings_file.clone(),
+            self.config.schema_json.clone(),
+            self.config.similarity_threshold,
        ));
    }

@@ -398,6 +436,10 @@ impl App {
                        self.config.model.clone(),
                        self.config.prompt_file.clone(),
                        self.config.db_file.clone(),
+                        self.config.use_table_selection,
+                        self.config.embeddings_file.clone(),
+                        self.config.schema_json.clone(),
+                        self.config.similarity_threshold,
                    ));
                    self.last_sql.clear();
                } else {
@@ -723,7 +765,12 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                let col_max_widths: Vec<usize> = (0..col_count)
                    .map(|i| {
                        let header_len = cols[i].len();
-                        let data_len = rows.iter().filter_map(|r| r.get(i)).map(|c| c.len()).max().unwrap_or(0);
+                        let data_len = rows
+                            .iter()
+                            .filter_map(|r| r.get(i))
+                            .map(|c| c.len())
+                            .max()
+                            .unwrap_or(0);
                        (header_len.max(data_len)).max(min_col_width as usize)
                    })
                    .collect();
@@ -732,16 +779,24 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                let use_wrap = total_needed > available_width as usize;

                if use_wrap {
-                    let wrap_width = (available_width as usize / col_count).max(min_col_width as usize);
-                    let header_lines: Vec<Line> = cols.iter()
+                    let wrap_width =
+                        (available_width as usize / col_count).max(min_col_width as usize);
+                    let header_lines: Vec<Line> = cols
+                        .iter()
                        .enumerate()
                        .map(|(i, c)| {
                            let wrapped = wrap_text(c, wrap_width);
-                            Line::from(wrapped)
+                            let spans: Vec<Span> =
+                                wrapped.into_iter().map(|s| Span::raw(s)).collect();
+                            Line::from(spans)
                        })
                        .collect();

-                    let max_header_lines = header_lines.iter().map(|l| l.len()).max().unwrap_or(1);
+                    let max_header_lines = header_lines
+                        .iter()
+                        .map(|l| l.spans.len())
+                        .max()
+                        .unwrap_or(1);

                    let mut all_row_lines: Vec<Vec<Line>> = Vec::new();
                    for row in rows {
@@ -749,19 +804,19 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                            .map(|i| {
                                let cell = row.get(i).map(|s| s.as_str()).unwrap_or("");
                                let wrapped = wrap_text(cell, wrap_width);
-                                Line::from(wrapped)
+                                let spans: Vec<Span> =
+                                    wrapped.into_iter().map(|s| Span::raw(s)).collect();
+                                Line::from(spans)
                            })
                            .collect();
-                        let max_lines = row_lines.iter().map(|l| l.len()).max().unwrap_or(1);
+                        let max_lines = row_lines.iter().map(|l| l.spans.len()).max().unwrap_or(1);
                        all_row_lines.push(row_lines);
                    }

                    let selected_idx = table_state.selected().unwrap_or(0);
                    let table_title = format!(" Resultados  ({}/{}) ", selected_idx + 1, n);

-                    let block = Block::default()
-                        .borders(Borders::ALL)
-                        .title(table_title);
+                    let block = Block::default().borders(Borders::ALL).title(table_title);

                    let area = chunks[2];
                    f.render_widget(block, area);
@@ -778,29 +833,32 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {

                    let start_row = if n > visible_rows as usize {
                        let scroll = selected_idx as i32 - visible_rows as i32 / 2;
-                        scroll.max(0) as usize.min(n.saturating_sub(visible_rows as usize))
+                        (scroll.max(0) as usize).min(n.saturating_sub(visible_rows as usize))
                    } else {
                        0
                    };

-                    let header_bg = Style::default().fg(Color::Yellow).add_modifier(Modifier::BOLD);
+                    let header_bg = Style::default()
+                        .fg(Color::Yellow)
+                        .add_modifier(Modifier::BOLD);
                    for (col_idx, header_line) in header_lines.iter().enumerate() {
                        let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1);
                        let col_width = wrap_width as u16;
-                        for (line_idx, line) in header_line.iter().enumerate() {
+                        for (line_idx, span) in header_line.spans.iter().enumerate() {
                            let y = inner_area.y + line_idx as u16;
                            if y >= inner_area.y + inner_area.height {
                                break;
                            }
-                            let spans: Vec<Span> = line.spans.iter().map(|s| {
-                                Span::styled(s.content.clone(), header_bg)
-                            }).collect();
-                            f.render_widget(Paragraph::new(Line::from(spans)), Rect {
-                                x: col_x,
-                                y,
-                                width: col_width,
-                                height: 1,
-                            });
+                            let styled_span = Span::styled(span.content.clone(), header_bg);
+                            f.render_widget(
+                                Paragraph::new(Line::from(styled_span)),
+                                Rect {
+                                    x: col_x,
+                                    y,
+                                    width: col_width,
+                                    height: 1,
+                                },
+                            );
                        }
                    }

@@ -811,7 +869,9 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                        }
                        let is_selected = row_idx == selected_idx;
                        let row_style = if is_selected {
-                            Style::default().bg(Color::DarkGray).add_modifier(Modifier::BOLD)
+                            Style::default()
+                                .bg(Color::DarkGray)
+                                .add_modifier(Modifier::BOLD)
                        } else {
                            Style::default()
                        };
@@ -820,20 +880,21 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                        for (col_idx, cell_lines) in row_lines.iter().enumerate() {
                            let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1);
                            let col_width = wrap_width as u16;
-                            for (line_idx, line) in cell_lines.iter().enumerate() {
+                            for (line_idx, span) in cell_lines.spans.iter().enumerate() {
                                let cell_y = y + line_idx as u16;
                                if cell_y >= inner_area.y + inner_area.height {
                                    break;
                                }
-                                let spans: Vec<Span> = line.spans.iter().map(|s| {
-                                    Span::styled(s.content.clone(), row_style)
-                                }).collect();
-                                f.render_widget(Paragraph::new(Line::from(spans)), Rect {
-                                    x: col_x,
-                                    y: cell_y,
-                                    width: col_width,
-                                    height: 1,
-                                });
+                                let styled_span = Span::styled(span.content.clone(), row_style);
+                                f.render_widget(
+                                    Paragraph::new(Line::from(styled_span)),
+                                    Rect {
+                                        x: col_x,
+                                        y: cell_y,
+                                        width: col_width,
+                                        height: 1,
+                                    },
+                                );
                            }
                        }

@@ -850,7 +911,8 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                        }
                    }
                } else {
-                    let col_widths: Vec<Constraint> = cols.iter()
+                    let col_widths: Vec<Constraint> = cols
+                        .iter()
                        .enumerate()
                        .map(|(i, _)| {
                            let w = col_max_widths[i] as u16;
@@ -1008,6 +1070,55 @@ fn ask_model(question: &str, schema: &str, model: &str, prompt_file: &str) -> Re
    Ok(ensure_sql(&sql))
 }

+fn ask_model_with_selection(
+    question: &str,
+    _full_schema: &str,
+    model: &str,
+    prompt_file: &str,
+    use_selection: bool,
+    embeddings_file: &str,
+    schema_json: &str,
+    similarity_threshold: f32,
+) -> Result<String> {
+    let prompt_template = fs::read_to_string(prompt_file)
+        .with_context(|| format!("Não foi possível ler o prompt: {}", prompt_file))?;
+
+    let (schema_to_use, selected_tables) = if use_selection {
+        match table_selector::select_tables_from_question(
+            question,
+            embeddings_file,
+            similarity_threshold,
+        ) {
+            Ok(table_ids) => {
+                eprintln!(
+                    "=> Selecionadas {} tables relevantes: {:?}",
+                    table_ids.len(),
+                    table_ids
+                );
+                let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
+                let filtered_schema = schema_filter.filter_tables(&table_ids);
+                (filtered_schema, Some(table_ids))
+            }
+            Err(e) => {
+                eprintln!(
+                    "=> Aviso: falha na seleção de tables ({}), usando schema completo",
+                    e
+                );
+                let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
+                (schema_filter.full_schema_text(), None)
+            }
+        }
+    } else {
+        let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
+        (schema_filter.full_schema_text(), None)
+    };
+
+    let generator = sql_generator::create_sql_generator()?;
+    let sql = generator.generate(question, &schema_to_use, &prompt_template)?;
+
+    Ok(ensure_sql(&sql))
+}
+
 fn ask_gemini(question: &str, system_prompt: &str, model: &str) -> Result<String> {
    let key = env::var("GEMINI_API_KEY").context("GEMINI_API_KEY não definida")?;
    let url = format!(
@@ -1309,6 +1420,12 @@ VARIÁVEIS DE AMBIENTE
  OPENROUTER_API_KEY   necessária para modelos OpenRouter
  GEMINI_MODEL         modelo padrão (sobrescrito por --model)
  SCHEMA_FILE          DDL do schema  [context/schema_compact_inline.txt]
+  SCHEMA_JSON          full schema JSON  [context/basedosdados-schema.json]
+  EMBEDDINGS_FILE      table embeddings  [context/table_embeddings.json]
+  TOP_K_TABLES         número de tables a selecionar  [5]
+  SQL_GENERATOR        sql generator: sqlcoder|gemini|openrouter  [gemini]
+  OLLAMA_MODEL         modelo ollama  [sqlcoder]
+  OLLAMA_HOST          host ollama  [http://localhost:11434]
  PROMPT_FILE          prompt do sistema  [ask/system_prompt.md]
  DB_FILE              banco DuckDB  [basedosdados.duckdb]
 "#
@@ -1321,7 +1438,18 @@ VARIÁVEIS DE AMBIENTE
    });
    let schema_file =
        env::var("SCHEMA_FILE").unwrap_or_else(|_| "context/schema_compact_inline.txt".into());
-    let db_file = env::var("DB_FILE").unwrap_or_else(|_| "basedosdados.duckdb".into());
+    let schema_json =
+        env::var("SCHEMA_JSON").unwrap_or_else(|_| "context/basedosdados-schema.json".into());
+    let embeddings_file =
+        env::var("EMBEDDINGS_FILE").unwrap_or_else(|_| "context/table_embeddings.json".into());
+    let similarity_threshold = env::var("SIMILARITY_THRESHOLD")
+        .ok()
+        .and_then(|v| v.parse().ok())
+        .unwrap_or(0.35);
+    let use_table_selection = env::var("USE_TABLE_SELECTION")
+        .map(|v| v != "false" && v != "0")
+        .unwrap_or(true);
+    let db_file = env::var("DB_FILE").unwrap_or_else(|_| "data/basedosdados.duckdb".into());
    let prompt_file = env::var("PROMPT_FILE").unwrap_or_else(|_| "ask/system_prompt.md".into());
    let schema = fs::read_to_string(&schema_file)
        .with_context(|| format!("Não foi possível ler o schema: {}", schema_file))?;
@@ -1333,6 +1461,10 @@ VARIÁVEIS DE AMBIENTE
            schema,
            db_file,
            prompt_file,
+            use_table_selection,
+            embeddings_file,
+            schema_json,
+            similarity_threshold,
        });
    }

@@ -1341,7 +1473,16 @@ VARIÁVEIS DE AMBIENTE
    eprintln!("\nModel:    {}\nPergunta: {}\n", model, question);

    let t0 = Instant::now();
-    let sql = ask_model(&question, &schema, &model, &prompt_file)?;
+    let sql = ask_model_with_selection(
+        &question,
+        &schema,
+        &model,
+        &prompt_file,
+        use_table_selection,
+        &embeddings_file,
+        &schema_json,
+        similarity_threshold,
+    )?;
    eprintln!("=> SQL gerado em {}", fmt_duration(t0.elapsed()));
    print_sql_box(&sql);

--- a/ask/src/schema_filter.rs
+++ b/ask/src/schema_filter.rs
@@ -0,0 +1,135 @@
+use serde::{Deserialize, Serialize};
+use std::collections::HashSet;
+use std::fs;
+use std::path::Path;
+
+#[derive(Debug, Clone, Deserialize, Serialize)]
+pub struct Column {
+    pub name: String,
+    #[serde(rename = "type")]
+    pub col_type: String,
+    pub description: Option<String>,
+}
+
+pub type TableColumns = Vec<Column>;
+
+#[derive(Debug, Clone, Deserialize)]
+pub struct FullSchema {
+    #[serde(flatten)]
+    pub datasets:
+        std::collections::HashMap<String, std::collections::HashMap<String, TableColumns>>,
+}
+
+pub struct SchemaFilter {
+    schema: FullSchema,
+}
+
+impl SchemaFilter {
+    pub fn new<P: AsRef<Path>>(schema_path: P) -> anyhow::Result<Self> {
+        let content = fs::read_to_string(schema_path)?;
+        let schema: FullSchema = serde_json::from_str(&content)?;
+        Ok(Self { schema })
+    }
+
+    pub fn filter_tables(&self, table_ids: &[String]) -> String {
+        let selected: HashSet<String> = table_ids.iter().cloned().collect();
+        let mut lines = Vec::new();
+
+        lines.push("# Base dos Dados — Filtered Schema".to_string());
+        lines.push(
+            "# Legend: V=VARCHAR I=INT D=DOUBLE Dt=DATE B=BOOLEAN Dec=DECIMAL Ts=TIMESTAMP Ti=TIME"
+                .to_string(),
+        );
+        lines.push("# Format: dataset.table: col:TYPE description".to_string());
+        lines.push(String::new());
+
+        for (dataset, tables) in &self.schema.datasets {
+            for (table, columns) in tables {
+                let full_id = format!("{}.{}", dataset, table);
+                if selected.contains(&full_id) {
+                    let col_str = columns
+                        .iter()
+                        .map(|c| {
+                            let desc = c.description.as_deref().unwrap_or("");
+                            if desc.is_empty() {
+                                format!("{}:{}", c.name, type_abbrev(&c.col_type))
+                            } else {
+                                format!("{}:{} {}", c.name, type_abbrev(&c.col_type), desc)
+                            }
+                        })
+                        .collect::<Vec<_>>()
+                        .join(" ");
+
+                    lines.push(format!("{}: {}", full_id, col_str));
+                }
+            }
+        }
+
+        lines.join("\n")
+    }
+
+    pub fn full_schema_text(&self) -> String {
+        let mut lines = Vec::new();
+
+        lines.push("# Base dos Dados — Full Schema".to_string());
+        lines.push(
+            "# Legend: V=VARCHAR I=INT D=DOUBLE Dt=DATE B=BOOLEAN Dec=DECIMAL Ts=TIMESTAMP Ti=TIME"
+                .to_string(),
+        );
+        lines.push("# Format: dataset.table: col:TYPE description".to_string());
+        lines.push(String::new());
+
+        for (dataset, tables) in &self.schema.datasets {
+            for (table, columns) in tables {
+                let full_id = format!("{}.{}", dataset, table);
+                let col_str = columns
+                    .iter()
+                    .map(|c| {
+                        let desc = c.description.as_deref().unwrap_or("");
+                        if desc.is_empty() {
+                            format!("{}:{}", c.name, type_abbrev(&c.col_type))
+                        } else {
+                            format!("{}:{} {}", c.name, type_abbrev(&c.col_type), desc)
+                        }
+                    })
+                    .collect::<Vec<_>>()
+                    .join(" ");
+
+                lines.push(format!("{}: {}", full_id, col_str));
+            }
+        }
+
+        lines.join("\n")
+    }
+
+    pub fn dataset_count(&self) -> usize {
+        self.schema.datasets.len()
+    }
+
+    pub fn table_count(&self) -> usize {
+        self.schema.datasets.values().map(|t| t.len()).sum()
+    }
+}
+
+fn type_abbrev(full_type: &str) -> String {
+    let upper = full_type.to_uppercase();
+    if upper.contains("VARCHAR") || upper.contains("STRING") {
+        "V".to_string()
+    } else if upper.contains("INT") {
+        "I".to_string()
+    } else if upper.contains("DOUBLE") || upper.contains("FLOAT") {
+        "D".to_string()
+    } else if upper.contains("DATE") && !upper.contains("TIMESTAMP") {
+        "Dt".to_string()
+    } else if upper.contains("TIMESTAMP") {
+        "Ts".to_string()
+    } else if upper.contains("TIME") {
+        "Ti".to_string()
+    } else if upper.contains("BOOLEAN") {
+        "B".to_string()
+    } else if upper.contains("DECIMAL") {
+        "Dec".to_string()
+    } else {
+        full_type.to_string()
+    }
+}
--- a/ask/src/sql_generator.rs
+++ b/ask/src/sql_generator.rs
@@ -0,0 +1,207 @@
+use anyhow::{Context, Result};
+use serde_json::Value;
+use std::env;
+
+pub trait SqlGenerator: Send + Sync {
+    fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String>;
+}
+
+pub fn create_sql_generator() -> Result<Box<dyn SqlGenerator>> {
+    let generator_type = env::var("SQL_GENERATOR").unwrap_or_else(|_| "gemini".to_string());
+
+    match generator_type.as_str() {
+        "sqlcoder" => Ok(Box::new(SqlCoderGenerator::new()?)),
+        "openrouter" => Ok(Box::new(OpenRouterGenerator::new()?)),
+        "gemini" => Ok(Box::new(GeminiGenerator::new()?)),
+        _ => anyhow::bail!(
+            "Unknown SQL_GENERATOR: {}. Use: sqlcoder, gemini, or openrouter",
+            generator_type
+        ),
+    }
+}
+
+pub struct GeminiGenerator {
+    model: String,
+    api_key: String,
+}
+
+impl GeminiGenerator {
+    pub fn new() -> Result<Self> {
+        let model = env::var("GEMINI_MODEL").unwrap_or_else(|_| "gemini-flash-latest".to_string());
+        let api_key = env::var("GEMINI_API_KEY").context("GEMINI_API_KEY not defined")?;
+        Ok(Self { model, api_key })
+    }
+}
+
+impl SqlGenerator for GeminiGenerator {
+    fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
+        let url = format!(
+            "https://generativelanguage.googleapis.com/v1beta/models/{}:generateContent",
+            self.model
+        );
+
+        let system_prompt = format!("{}\n\nSchema DDL:\n\n{}", prompt_template.trim(), schema);
+
+        let payload = serde_json::json!({
+            "system_instruction": { "parts": [{ "text": system_prompt }] },
+            "contents": [{ "parts": [{ "text": question }] }]
+        });
+
+        let client = reqwest::blocking::Client::builder()
+            .timeout(std::time::Duration::from_secs(300))
+            .build()?;
+
+        let resp = client
+            .post(&url)
+            .header("Content-Type", "application/json")
+            .header("X-goog-api-key", &self.api_key)
+            .json(&payload)
+            .send()
+            .context("Gemini HTTP request failed")?;
+
+        let status = resp.status();
+        let body: Value = resp.json().context("Failed to parse Gemini response")?;
+
+        if !status.is_success() {
+            anyhow::bail!("Gemini API error {}: {}", status, body);
+        }
+
+        let text = body["candidates"][0]["content"]["parts"][0]["text"]
+            .as_str()
+            .context("Unexpected Gemini response format")?
+            .trim()
+            .to_string();
+
+        Ok(strip_fences(&text))
+    }
+}
+
+pub struct OpenRouterGenerator {
+    model: String,
+    api_key: String,
+}
+
+impl OpenRouterGenerator {
+    pub fn new() -> Result<Self> {
+        let model =
+            env::var("OPENROUTER_MODEL").unwrap_or_else(|_| "openai/gpt-4o-mini".to_string());
+        let api_key = env::var("OPENROUTER_API_KEY").context("OPENROUTER_API_KEY not defined")?;
+        Ok(Self { model, api_key })
+    }
+}
+
+impl SqlGenerator for OpenRouterGenerator {
+    fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
+        let url = "https://openrouter.ai/api/v1/chat/completions";
+
+        let system_prompt = format!("{}\n\nSchema DDL:\n\n{}", prompt_template.trim(), schema);
+
+        let payload = serde_json::json!({
+            "model": self.model,
+            "messages": [
+                { "role": "system", "content": system_prompt },
+                { "role": "user", "content": question }
+            ]
+        });
+
+        let client = reqwest::blocking::Client::builder()
+            .timeout(std::time::Duration::from_secs(300))
+            .build()?;
+
+        let resp = client
+            .post(url)
+            .header("Content-Type", "application/json")
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .header("HTTP-Referer", "https://basedosdados.org")
+            .header("X-Title", "Base dos Dados Ask")
+            .json(&payload)
+            .send()
+            .context("OpenRouter HTTP request failed")?;
+
+        let status = resp.status();
+        let body: Value = resp.json().context("Failed to parse OpenRouter response")?;
+
+        if !status.is_success() {
+            anyhow::bail!("OpenRouter API error {}: {}", status, body);
+        }
+
+        let text = body["choices"][0]["message"]["content"]
+            .as_str()
+            .context("Unexpected OpenRouter response format")?
+            .trim()
+            .to_string();
+
+        Ok(strip_fences(&text))
+    }
+}
+
+pub struct SqlCoderGenerator {
+    model: String,
+    host: String,
+}
+
+impl SqlCoderGenerator {
+    pub fn new() -> Result<Self> {
+        let model = env::var("OLLAMA_MODEL").unwrap_or_else(|_| "sqlcoder".to_string());
+        let host = env::var("OLLAMA_HOST").unwrap_or_else(|_| "http://localhost:11434".to_string());
+        Ok(Self { model, host })
+    }
+}
+
+impl SqlGenerator for SqlCoderGenerator {
+    fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
+        let url = format!("{}/api/generate", self.host);
+
+        let full_prompt = format!(
+            "{}\n\nSchema DDL:\n\n{}\n\nQuestion: {}\n\nSQL:",
+            prompt_template.trim(),
+            schema,
+            question
+        );
+
+        let payload = serde_json::json!({
+            "model": self.model,
+            "prompt": full_prompt,
+            "stream": false
+        });
+
+        let client = reqwest::blocking::Client::builder()
+            .timeout(std::time::Duration::from_secs(300))
+            .build()?;
+
+        let resp = client
+            .post(&url)
+            .header("Content-Type", "application/json")
+            .json(&payload)
+            .send()
+            .context("Ollama HTTP request failed")?;
+
+        let status = resp.status();
+        let body: Value = resp.json().context("Failed to parse Ollama response")?;
+
+        if !status.is_success() {
+            anyhow::bail!("Ollama API error {}: {}", status, body);
+        }
+
+        let text = body["response"]
+            .as_str()
+            .context("Unexpected Ollama response format")?
+            .trim()
+            .to_string();
+
+        Ok(strip_fences(&text))
+    }
+}
+
+fn strip_fences(text: &str) -> String {
+    let text = text.trim();
+    if text.starts_with("```sql") {
+        let end = text.find("```").unwrap_or(text.len());
+        text[5..end].trim().to_string()
+    } else if text.starts_with("```") {
+        let end = text[3..].find("```").map(|i| i + 3).unwrap_or(text.len());
+        text[3..end].trim().to_string()
+    } else {
+        text.to_string()
+    }
+}
--- a/ask/src/table_selector.rs
+++ b/ask/src/table_selector.rs
@@ -0,0 +1,146 @@
+use serde::{Deserialize, Serialize};
+use std::fs;
+use std::path::Path;
+
+const DEFAULT_SIMILARITY_THRESHOLD: f32 = 0.35;
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct TableEmbedding {
+    pub id: String,
+    pub text: String,
+    pub embedding: Vec<f32>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct EmbeddingsIndex {
+    pub tables: Vec<TableEmbedding>,
+    pub model: String,
+}
+
+pub struct TableSelector {
+    tables: Vec<TableEmbedding>,
+    threshold: f32,
+}
+
+impl TableSelector {
+    pub fn new<P: AsRef<Path>>(embeddings_path: P, threshold: f32) -> anyhow::Result<Self> {
+        let content = fs::read_to_string(embeddings_path)?;
+        let index: EmbeddingsIndex = serde_json::from_str(&content)?;
+        Ok(Self {
+            tables: index.tables,
+            threshold,
+        })
+    }
+
+    pub fn select_tables(
+        &self,
+        question: &str,
+        model: &dyn QuestionEmbedder,
+    ) -> anyhow::Result<Vec<String>> {
+        let question_embedding = model.embed(question)?;
+
+        let mut similarities: Vec<(usize, f32)> = self
+            .tables
+            .iter()
+            .enumerate()
+            .map(|(i, table)| {
+                let sim = cosine_similarity(&question_embedding, &table.embedding);
+                (i, sim)
+            })
+            .collect();
+
+        similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
+
+        let selected: Vec<String> = similarities
+            .into_iter()
+            .filter(|(_, sim)| *sim >= self.threshold)
+            .map(|(i, sim)| {
+                eprintln!("  {} (similarity: {:.3})", self.tables[i].id, sim);
+                self.tables[i].id.clone()
+            })
+            .collect();
+
+        Ok(selected)
+    }
+
+    pub fn get_table_texts(&self, table_ids: &[String]) -> Vec<String> {
+        table_ids
+            .iter()
+            .filter_map(|id| self.tables.iter().find(|t| &t.id == id))
+            .map(|t| t.text.clone())
+            .collect()
+    }
+
+    pub fn table_count(&self) -> usize {
+        self.tables.len()
+    }
+}
+
+pub trait QuestionEmbedder: Send + Sync {
+    fn embed(&self, text: &str) -> anyhow::Result<Vec<f32>>;
+}
+
+pub struct LocalEmbedder {
+    model_path: String,
+}
+
+impl LocalEmbedder {
+    pub fn new(model_path: String) -> Self {
+        Self { model_path }
+    }
+}
+
+impl QuestionEmbedder for LocalEmbedder {
+    fn embed(&self, text: &str) -> anyhow::Result<Vec<f32>> {
+        use std::process::Command;
+
+        let output = Command::new("python3")
+            .args([
+                "-c",
+                &format!(
+                    r#"
+import json
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('{}')
+emb = model.encode('{}', convert_to_numpy=True)
+print(json.dumps([float(x) for x in emb]))
+"#,
+                    self.model_path,
+                    text.replace("'", "\\'")
+                ),
+            ])
+            .output()?;
+
+        if !output.status.success() {
+            let err = String::from_utf8_lossy(&output.stderr);
+            anyhow::bail!("Embedding generation failed: {}", err);
+        }
+
+        let output_str = String::from_utf8_lossy(&output.stdout);
+        let floats: Vec<f32> = serde_json::from_str(&output_str)?;
+
+        Ok(floats)
+    }
+}
+
+fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
+    let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
+    let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
+    let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
+
+    if norm_a == 0.0 || norm_b == 0.0 {
+        0.0
+    } else {
+        dot_product / (norm_a * norm_b)
+    }
+}
+
+pub fn select_tables_from_question(
+    question: &str,
+    embeddings_path: &str,
+    threshold: f32,
+) -> anyhow::Result<Vec<String>> {
+    let selector = TableSelector::new(embeddings_path, threshold)?;
+    let embedder = LocalEmbedder::new("all-MiniLM-L6-v2".to_string());
+    selector.select_tables(question, &embedder)
+}
--- a/ask/system_prompt.md
+++ b/ask/system_prompt.md
@@ -147,3 +147,68 @@ LIMIT 30
    if the question requires tables not in the provided DDL, OR
      If you cant generate a valid SQL, 
        answer as a JSON {error: "#{reason}"}
+
+
+## Common SQL Pitfalls & Debugging Strategy
+
+### 1. Column Propagation in CTEs (Most Common Error!)
+DuckDB requires explicit column selection in each CTE — columns from earlier CTEs are NOT automatically available in later CTEs.
+
+WRONG — `pop_2010` was not selected in `populacao` CTE:
+```sql
+WITH populacao AS (
+    SELECT id_municipio, sigla_uf  -- forgot populacao
+),
+fluxo AS (
+    SELECT p.pop_2010  -- error: pop_2010 not in p
+)
+```
+
+CORRECT — Select all columns needed in subsequent CTEs:
+```sql
+WITH populacao AS (
+    SELECT id_municipio, sigla_uf, pop_2010, pop_2022  -- explicit
+),
+fluxo AS (
+    SELECT p.pop_2010  -- works
+)
+```
+
+### 2. ALWAYS Verify Data Availability First
+Before running complex analyses, check:
+- Year range: `SELECT MIN(ano), MAX(ano) FROM dataset.table`
+- Record count: `SELECT COUNT(*) FROM dataset.table`
+- ID format compatibility between tables before JOIN
+
+### 3. Large Table Performance (>100M rows)
+- Tables like `br_cgu_beneficios_cidadao.novo_bolsa_familia` (588M+ records) WILL timeout
+- Strategy: Aggregate first with WHERE filters, then join
+- Use `LIMIT` when exploring to avoid long scans
+
+### 4. Lock Conflicts
+Multiple concurrent DuckDB queries on the same `.duckdb` file cause lock errors.
+- Wait between queries or use read-only mode
+
+### 5. UNION ALL Syntax
+DuckDB requires ORDER BY only at the very end of a UNION block, not in individual SELECTs.
+
+WRONG:
+```sql
+SELECT ... LIMIT 5
+ORDER BY x
+UNION ALL
+SELECT ... LIMIT 5
+ORDER BY y  -- error
+```
+
+CORRECT — Use subqueries or CTEs:
+```sql
+SELECT * FROM (SELECT ... ORDER BY x LIMIT 5) a
+UNION ALL
+SELECT * FROM (SELECT ... ORDER BY y LIMIT 5) b
+```
+
+### 6. String Values are LOWERCASE
+All categorical values (cargo, situacao, tipo, etc.) are stored in lowercase.
+Always use: `WHERE cargo = 'deputado federal'` not `'DEPUTADO FEDERAL'`
+
--- a/basedosdados.duckdb
+++ b/basedosdados.duckdb
--- a/context/basedosdados-schema.json
+++ b/context/basedosdados-schema.json
--- a/context/table_embeddings.json
+++ b/context/table_embeddings.json
--- a/data/basedosdados.duckdb
+++ b/data/basedosdados.duckdb
--- a/docs/dataset_embeds.md
+++ b/docs/dataset_embeds.md
@@ -0,0 +1,59 @@
+## Goal
+
+Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.
+
+## Instructions
+
+- Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
+- Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
+- Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
+- Include column descriptions from basedosdados-schema.json in table embeddings
+- Generate word clouds from schema attributes and dataset names for docs
+
+## Discoveries
+
+- **Schema format**: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
+- **Embeddings work**: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
+- **Threshold tuning**: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
+- **sqlcoder issues**: Returns JSON instead of SQL when using `format: "json"` - removing it helps but still generates imperfect SQL
+- **Retry mechanism**: Already built into main.rs - helps fix SQL errors automatically
+- **Top donation query works**: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)
+
+## Accomplished
+
+1. ✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json
+2. ✅ Created table_embeddings.json (~2MB, 765 tables)
+3. ✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
+4. ✅ Created schema_filter.rs - extracts filtered schema from full JSON
+5. ✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
+6. ✅ Modified main.rs - integrated table selection + configurable SQL generator
+7. ✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes)
+8. ✅ Updated README.md with new architecture and env vars
+9. ✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/
+
+## Relevant files / directories
+
+### Created/Modified
+- `embed_tables.py` - Python script to generate table embeddings
+- `context/table_embeddings.json` - Pre-computed embeddings (765 tables)
+- `ask/src/table_selector.rs` - Table selection via embeddings
+- `ask/src/schema_filter.rs` - Schema filtering module
+- `ask/src/sql_generator.rs` - SQL generator trait + implementations
+- `ask/src/main.rs` - Integrated all components
+- `ask/Cargo.toml` - Added serde dependency
+- `README.md` - Updated with new architecture
+- `docs/wordcloud_attributes.png` - Word cloud from column names/descriptions
+- `docs/wordcloud_datasets.png` - Word cloud from dataset names
+
+### Configuration (env vars)
+- `SQL_GENERATOR` - sqlcoder|gemini|openrouter
+- `SIMILARITY_THRESHOLD` - 0.35 default
+- `OLLAMA_MODEL` - sqlcoder:7b-q4_K_M
+- `EMBEDDINGS_FILE`, `SCHEMA_JSON`
+
+## Next Steps
+
+- Increase similarity threshold (try 0.45) to reduce table count
+- Improve sqlcoder prompt for better SQL generation
+- Add fallback to increase threshold if too many tables selected
+- Consider keyword matching as backup if embeddings fail
--- a/docs/file_tree.md
+++ b/docs/file_tree.md
--- a/docs/patterns-audit.md
+++ b/docs/patterns-audit.md
@@ -0,0 +1,299 @@
+# Pattern Audit — Robustness & False Positive Analysis
+
+Deep audit of all 8 risk patterns. For each pattern: legal basis, threshold rationale, known false positive scenarios, data quality notes, and differences between the per-CNPJ (interactive) and batch (scan-all) implementations.
+
+---
+
+## US1 — Split Contracts Below Threshold (`split_contracts_below_threshold`)
+
+### Legal basis
+**Fracionamento de licitação** is prohibited by:
+- Lei 8.666/1993, art. 23, §5º: "É vedada a utilização da modalidade 'convite' ou 'tomada de preços' [...] para parcelas de uma mesma obra ou serviço."
+- Lei 14.133/2021, art. 145: directly prohibits splitting to evade the mandatory bidding requirement.
+
+### Threshold: year-dependent
+
+| Period | Threshold | Legal basis |
+|---|---|---|
+| ≤ 2023 | R$ 17.600 | Decreto 9.412/2018 / Lei 8.666/93 art. 23, I, "a" |
+| 2024+ | R$ 57.912 | Decreto 11.871/2024 / Lei 14.133/2021 art. 75, I |
+
+For 2023 data many contracts still ran under Lei 8.666/93 (both laws co-existed). From 2024 the threshold is R$57.912. Using a static R$17.600 for 2024+ data would miss the main fraud window (R$17k–R$57k per contract). **Fixed (iteration 7):** all three implementations compute the threshold from the query year.
+
+### False positive scenarios
+1. **Legitimate multi-item purchasing**: A supplier providing diverse small items (office supplies, food for canteen) legitimately generates many small contracts below threshold from the same agency. The `combined_value > threshold` guard reduces but doesn't eliminate this.
+2. **Recurring service contracts**: Monthly service fees (e.g., R$1.500/month cleaning) generate 12 contracts/year — correctly NOT flagged (combined = R$18.000 > threshold, count ≥ 3 in first 3 months).
+3. **Different sub-units**: The grouping uses `id_orgao_superior` (ministry level). A ministry with many sub-units contracting independently may not be splitting; they may have independent needs.
+
+### Improvements applied
+- None structural. Filter `valor_inicial_compra > 0` prevents division issues.
+
+### Known data quality issues
+- `data_assinatura_contrato` can be NULL for some older contracts. **`FORMAT_DATE` on NULL returns NULL — it does NOT exclude those rows.** Without a guard, all NULL-dated contracts from the same agency would be grouped together under a single `NULL` month bucket, potentially producing a false flag if ≥3 of them are below threshold with combined value > threshold. Fixed (iteration 5): all three implementations now include `AND data_assinatura_contrato IS NOT NULL` in the WHERE clause.
+- `valor_inicial_compra` vs `valor_final_compra`: we use `valor_inicial_compra` intentionally since splitting is defined by the contract as signed, not final.
+
+### Improvements applied (iteration 5)
+- Added `AND data_assinatura_contrato IS NOT NULL` to WHERE clause in all three implementations to prevent NULL-date contracts from being grouped into a spurious `mes = NULL` bucket.
+
+### Per-CNPJ vs batch consistency
+✅ Fixed (iteration 8): `scan-all.ts` now includes `id_orgao_superior` in both SELECT and GROUP BY, matching `index.ts` and `scan-suspicious.ts`. Prevents theoretical merging of two distinct ministries sharing the same name.
+
+---
+
+## US2 — Contract Concentration (`contract_concentration`)
+
+### Legal basis
+No specific legal prohibition, but **TCU** and **CGU** audit methodology treat >40% share of a single agency's budget as a prima facie risk indicator requiring justification.
+- Reference: CGU "Manual de Orientações para Análise de Risco em Compras Públicas" (2022), section 4.2.
+
+### Thresholds
+- **40% share**: empirical; above this, competition is functionally absent for that agency.
+- **R$ 50.000 minimum agency total**: excludes micro-units (small local offices) where one purchase naturally dominates.
+- **R$ 10.000 minimum supplier spend** (new, iteration 2): excludes trivial cases like a company with R$21k of a R$50k agency = 42% but both numbers are small.
+
+### False positive scenarios
+1. **Specialized niches**: A sole provider of a specialized service (e.g., judicial translation, specific medical device) may legitimately dominate one agency's procurement. No CNAE-based filter exists.
+2. **Monopolistic markets**: Some goods/services have few suppliers by nature (utilities, telecommunications infrastructure).
+3. **Framework agreements**: A single framework contract can make one supplier appear to dominate even if bidding was competitive at framework establishment.
+
+### Improvements applied
+- Added `CONCENTRATION_MIN_SUPPLIER_SPEND = 10_000` to batch query and `scan-suspicious.ts` (iteration 2).
+- Added `CONCENTRATION_MIN_SUPPLIER_SPEND` filter to `index.ts` `patternConcentration` HAVING clause (iteration 4 — was present in batch/scan-suspicious but missing from web UI).
+
+### Per-CNPJ vs batch consistency
+✅ Fixed (iteration 4): `index.ts` HAVING clause now includes `supplier_spend >= CONCENTRATION_MIN_SUPPLIER_SPEND`.
+✅ Fixed (iteration 9): `scan-all.ts` and `scan-suspicious.ts` now group by `(id_orgao_superior, nome_orgao_superior)` in both the spend and ministry_total CTEs, joining on the composite key. All three implementations are consistent.
+
+---
+
+## US3 — Inexigibility Recurrence (`inexigibility_recurrence`)
+
+### Legal basis
+**Inexigibilidade de licitação** (Lei 14.133/2021 art. 74; Lei 8.666/93 art. 25) is legal when competition is technically impossible (e.g., exclusive supplier, artistic performances). Abuse occurs when agencies use inexigibilidade repeatedly for the same supplier to avoid competitive bidding.
+- Reference: **TCU Acórdão 1.793/2011**: defines recurrent inexigibilidade as a risk indicator requiring documentation of technical exclusivity per contract.
+
+### Threshold: 3 contracts per managing unit
+- Below 3: could be two legitimate sole-source needs in the same year.
+- At 3+: pattern suggests systematic routing of contracts to avoid bidding.
+
+### False positive scenarios
+1. **Legitimate exclusive suppliers**: Publishers (publishing rights), performing arts venues, specialized IT vendors with proprietary systems legitimately receive many inexigibilidade contracts.
+2. **Long-term technical partnerships**: An agency may have a multi-year framework with an exclusive technical partner, generating many inexigibilidade contracts each year.
+3. **Artistic/cultural organizations**: Museums, theaters, and orchestras commonly contract artists via inexigibilidade.
+
+### Improvements applied (iteration 2)
+- **Batch + scan-suspicious**: Now groups by `id_unidade_gestora` (ID) + `nome_unidade_gestora` (name). Previously grouped by name only, risking merger of distinct units sharing a common name.
+- **Batch + scan-suspicious**: Added `valor_inicial_compra >= R$ 1.000` filter. Micro-value contracts (< R$1k) rarely represent real abuse.
+
+### Improvements applied (iteration 4)
+- **`index.ts`**: Added `AND valor_inicial_compra >= @min_value` to WHERE clause of `patternInexigibility`. The web UI was missing this filter, causing micro-value contracts to inflate the count and trigger false flags.
+
+### Per-CNPJ vs batch consistency
+✅ Fixed (iteration 4): all three implementations now filter `valor_inicial_compra >= R$ 1.000` and group by `id_unidade_gestora`.
+
+---
+
+## US4 — Single Bidder (`single_bidder`)
+
+### Legal basis
+Not inherently illegal, but flagged by:
+- **Open Contracting Partnership "73 Red Flags" (2024)**, Flag #1: "Only one bid received."
+- CGU "Programa de Fiscalização em Entes Federativos" 2023: single-bidder rate >30% is a tier-1 risk indicator.
+
+### Threshold: 2 occurrences
+- Intentionally low. Even one solo-bid win warrants investigation context. Two is the minimum pattern.
+
+### False positive scenarios
+1. **Specialized markets**: Satellite communications, nuclear materials, specialized medical devices — few vendors exist globally.
+2. **Geographic isolation**: Remote municipalities with limited local suppliers naturally attract few bidders even for standard goods.
+3. **Poorly timed notices**: Short bid windows or holiday periods reduce participation regardless of market structure.
+
+### SQL robustness notes
+- Per-CNPJ: uses `STARTS_WITH(REGEXP_REPLACE(...), @cnpj)` — this matches any CNPJ where the base 8 digits match, including subsidiaries/branches. This is intentional: a corporate group that operates through multiple CNPJs should still surface.
+- Batch: uses `MAX(IF(vencedor AND LENGTH(...) = 14, SUBSTR(...), NULL))` to extract the winner's CNPJ from the `auction_stats` CTE. The `LENGTH = 14` guard in the `IF` condition ensures CPF winners don't produce invalid 8-digit keys. If two CNPJ rows have `vencedor=true` for the same auction (data quality issue), `MAX` picks lexicographically last — acceptable for batch purposes.
+
+### Per-CNPJ vs batch consistency
+✅ Fixed (iteration 8): **batch now counts ALL participants** (CPF + CNPJ) for `total_bidders`, matching per-CNPJ behavior. Previously, `LENGTH = 14` excluded CPF individuals from the count, causing the batch to over-flag auctions where a CPF participant was present. The `LENGTH = 14` guard is now applied only inside the `winner_cnpj` extraction `IF()` condition — not to the overall participant count.
+
+---
+
+## US5 — Always Winner (`always_winner`)
+
+### Legal basis
+Not illegal per se, but high win rates in competitive auctions indicate possible:
+- Bid rigging (Lei 12.529/2011 art. 36, IV)
+- Tailored specifications (Lei 14.133/2021 art. 9, I)
+- Reference: **OCDE "Guidelines for Fighting Bid Rigging in Public Procurement" (2021)**
+
+### Thresholds
+- **≥80% win rate** (per-CNPJ, fixed) — raised from 60% to reduce false positives. Batch uses dynamic Q3 (empirically ≈100% in this dataset).
+- **≥10 competitive participations** — minimum sample for statistical significance. Aligns batch and per-CNPJ.
+- **Competitive auctions only (≥2 bidders)** — critical to avoid overlap with US4.
+
+### Critical fix applied (iteration 2)
+**The per-CNPJ version was NOT filtering for competitive auctions before this iteration.** A company that always won because it was always the only bidder would be flagged by both US4 (single_bidder) AND US5 (always_winner) — misleading double-counting. Fixed by adding a `competitive_auctions` CTE that filters `COUNT(1) >= 2`.
+
+### Win rate distribution note
+The `licitacao_participante` dataset is **strongly bimodal**: approximately 33% of companies with ≥10 competitive participations have a perfect 100% win rate. The distribution does not follow a normal or uniform pattern. Q3 ≈ 1.0 regardless of the minimum sample cutoff (tested at 5, 10, 20). The dynamic Q3 threshold therefore flags only **perfect-win companies** — intentionally strict. This is documented in the spec.
+
+### Per-CNPJ vs batch consistency
+✅ Fixed (iteration 2): both now filter for competitive auctions. Batch uses dynamic Q3; per-CNPJ uses fixed 0.80 threshold. The fixed threshold produces a slightly broader result set on the interactive page, which is acceptable — the batch feed should be conservative; per-CNPJ investigation mode can be more sensitive.
+
+---
+
+## US6 — Amendment Inflation (`amendment_inflation`)
+
+### Legal basis
+**Lei 14.133/2021 art. 125 §1º**: amendments may not increase the contract value by more than 25% of the original (for goods/services) or 50% (for construction). Inflation ≥ 1.25× means the contract **reached or exceeded its legal ceiling**.
+
+### Threshold: 1.25× (25% above original)
+- Exactly the legal maximum. Contracts at 1.25× are at the legal limit; contracts above are potentially illegal unless specific circumstances apply (art. 125 §2º exceptions).
+
+### False positive scenarios
+1. **Lawful exceptional amendments**: Art. 125 §2º allows exceeding 25% for "additional work indispensable to the object's completion" — requires specific administrative justification.
+2. **Construction contracts**: Legal ceiling is 50% (not 25%). Our threshold of 1.25× flags construction contracts that are within the legal limit.
+3. **Value adjustment clauses**: Contracts with inflation adjustment clauses (INPC/IPCA) can legitimately reach or exceed 1.25× over multi-year terms without any amendment.
+4. **Data entry errors**: Some `valor_final_compra` values are clearly data quality issues (e.g., 100× original).
+
+### Improvements applied (iteration 3)
+- **Cap `inflation_ratio` at 10×** (`AMENDMENT_MAX_INFLATION_RATIO = 10.0`): ratios above this threshold are almost certainly data entry errors (e.g., `valor_final_compra` entered in a different unit) and would distort `total_excess` reporting. Applied to all three implementations via `AND ... <= @max_ratio` filter in SQL. Applied in `index.ts`, `scan-all.ts`, `scan-suspicious.ts`.
+
+### Schema verification: construction vs goods/services threshold
+Lei 14.133/2021 art.125 §1º allows 50% amendments for engineering works vs 25% for goods/services.
+
+**Column verified (schema dump):** `contrato_compra` has `id_modalidade_licitacao` (code) and `modalidade_licitacao` (name). However, this column encodes **bidding modality** (Concorrência, Pregão Eletrônico, Tomada de Preços, etc.) — not contract category (obras vs bens/serviços). There is no `tipo_contrato` or `categoria` column in the accessible schema.
+
+### Improvements applied (iteration 8): construction keyword detection
+All three implementations now apply `IF(REGEXP_CONTAINS(LOWER(IFNULL(objeto, '')), r'obra|constru|reform|engenhari|paviment|demoli'), 1.50, 1.25)` to select the applicable legal threshold per contract. This reduces false positives for legitimate construction/engineering amendments that fall between 1.25× and 1.50×.
+
+**Keywords and rationale:**
+| Keyword | Matches | Rationale |
+|---------|---------|-----------|
+| `obra` | obra, obras | General construction work |
+| `constru` | construção, construir | Building/construction |
+| `reform` | reforma, reformar, reformas | Renovation/remodeling |
+| `engenhari` | engenharia, engenheiro | Engineering services |
+| `paviment` | pavimentação, pavimento | Road/floor paving |
+| `demoli` | demolição, demolir | Demolition |
+
+**Known limitations:** The `objeto` field is free-text entered by procurement officers. Some construction contracts may use generic descriptions ("serviços de manutenção") and be missed by this detection — applying the 1.25× threshold is safe for those (conservative false positive vs missed construction exemption).
+
+### Improvements applied (iteration 9): constructionCount field
+`AmendmentInflationFlag` now includes `constructionCount`: the number of flagged contracts that matched the construction keywords and were therefore evaluated at the 1.50× threshold. The UI card shows this count with a tooltip explaining the applicable legal ceiling. This helps analysts distinguish "inflated by >25% on goods (potentially illegal)" from "inflated by >50% on obras (definitely exceeds even the construction ceiling)."
+
+### Per-CNPJ vs batch consistency
+⚠️ Minor divergence (accepted): `index.ts` includes the aditivos CTE (`zeroAmendmentCount`) and `constructionCount` from `is_construction`. The batch scanners do NOT include these — `contrato_termo_aditivo` full scan is too expensive in batch, and `constructionCount` is per-row info not aggregable without the row-level data. Both fields are only available in the web UI's per-CNPJ output.
+
+---
+
+## US7 — Newborn Company (`newborn_company`)
+
+### Legal basis
+No specific prohibition, but:
+- **Lei 14.133/2021 art. 68, I**: suppliers must demonstrate technical and economic qualification. Newly incorporated companies rarely can.
+- CGU "Guia Prático de Análise de Empresas de Fachada" (2021): age < 6 months at contract signing is a tier-1 indicator of possible shell company.
+
+### Thresholds
+- **180 days** (6 months): practical minimum for legitimate operational readiness.
+- **R$ 50.000 minimum contract value**: excludes training contracts and small acquisitions where new companies are common and low-risk.
+
+### False positive scenarios
+1. **Spinoffs and restructurings**: A newly incorporated CNPJ may be a restructured entity of an existing business with full operational capacity.
+2. **Holding company structures**: A holding created to receive a specific contract may have the technical capacity of its parent, not its founding date.
+3. **Startups in innovation programs**: Government startup accelerator programs (e.g., FAPESP TT, EMBRAPII) specifically contract very new companies.
+4. **`data_inicio_atividade` from establishments**: The founding date comes from `br_me_cnpj.estabelecimentos`, not `empresas`. Branches opened after the headquarter can make an established company appear "newborn" in a specific municipality.
+
+### Data quality note
+`data_inicio_atividade` lives in `br_me_cnpj.estabelecimentos`, NOT `empresas`. The query uses `MIN(est.data_inicio_atividade)` across all establishments for the same `cnpj_basico` — this correctly picks the earliest known opening date, reducing the false positive of branches.
+
+### Per-CNPJ vs batch consistency
+✅ Equivalent. Both use `MIN(data_inicio_atividade)` across establishments with `ano=2023 AND mes=12`.
+
+⚠️ **Known necessary full-table scan**: The `first_contract` CTE in `batchNewborn` (`scan-all.ts`) intentionally omits an `ano` filter on `contrato_compra`:
+```sql
+FROM `basedosdados.br_cgu_licitacao_contrato.contrato_compra`
+WHERE LENGTH(REGEXP_REPLACE(cpf_cnpj_contratado, r'\D', '')) = 14
+  AND valor_final_compra >= <MIN_VALUE>
+GROUP BY cnpj_basico
+```
+This is a deliberate exception to the "zero full-table scans" rule from the spec. The pattern asks: *"did this company win its very first contract within 180 days of founding?"* Restricting to `ano = ANO` would miss the true first contract if it occurred in an earlier year — producing a false negative. The `founding` CTE correctly filters `e.ano = ANO AND est.ano = ANO AND est.mes = 12`. Only `first_contract` scans all years, but the `LENGTH = 14` CPF exclusion and `valor_final_compra >= R$ 50k` filter significantly reduce bytes scanned.
+
+---
+
+## US8 — Sudden Surge (`sudden_surge`)
+
+### Legal basis
+Not illegal, but flagged by:
+- **UNODC "Guidebook on anti-corruption in public procurement" (2013)**: "Sudden large increase in a company's public contract revenue" is a tier-2 risk indicator.
+- TCU Acórdão 2.622/2015: large YoY procurement increases without prior procurement history warrant scrutiny.
+
+### Thresholds
+- **5× YoY growth**: chosen to exclude normal business growth (2-3×) while flagging exponential jumps.
+- **R$ 1.000.000 minimum**: a 5× jump from R$200k to R$1M is meaningful; from R$10k to R$50k is noise.
+- **4-year lookback**: captures context before the surge.
+
+### False positive scenarios
+1. **Post-restructuring recovery**: A company that was inactive for 2 years then resumed full operations would appear to surge.
+2. **New framework agreements**: Being added to a large framework agreement in year N can produce apparent surge with no underlying change in the company.
+3. **Government budget cycles**: Some sectors receive large multi-year contracts every 4 years (e.g., IT system replacements) creating apparent surges.
+
+### SQL robustness note
+Both per-CNPJ and batch use `prev_v > 0` guard to exclude zero→nonzero transitions (handled by US7 newborn_company instead). The batch uses `LAG` window function; per-CNPJ iterates over the history array client-side.
+
+**Consecutive-year guard (iteration 6):** The spec says `value[year_N] / value[year_N-1]`. Without a guard, `LAG` compares any adjacent rows in sorted order — if a company had data in 2019 and 2023 (dormant 2020–2022), the comparison spans 4 years and produces a false surge. Fixed by:
+- `scan-all.ts`: added `LAG(ano)` alongside `LAG(v)` and `WHERE ano - prev_ano = 1`
+- `index.ts`, `scan-suspicious.ts`: added `curr.ano - prev.ano === 1` to the JS loop condition
+
+**false positive (false negative from audit):** The first false positive scenario (post-restructuring recovery) is now LESS likely to trigger since the consecutive-year guard would catch companies dormant for ≥1 year.
+
+The per-CNPJ implementation reports only the **first** qualifying surge year (breaks on first hit). If a company surged twice, only the earlier event is shown. This is conservative.
+
+### Per-CNPJ vs batch consistency
+✅ Equivalent. Batch uses SQL `LAG`; per-CNPJ uses JS loop. Both find the first qualifying year.
+
+---
+
+## Infrastructure Issue: Cache Miss vs Stored Null
+
+### Bug 1: Cache Miss vs Stored Null (fixed iteration 6)
+
+`cache.ts` `getCache` was returning `null` for both cache misses (file not found) and legitimately stored null values (pattern found nothing). Patterns US4–US8 and the company lookup all use `null` as their "nothing found" sentinel and check `cached !== undefined` to skip re-querying. With the old `getCache` returning `null` on miss, `null !== undefined` evaluated to `true`, causing the BigQuery query to be skipped permanently — US4–US8 would never execute on a CNPJ not yet in cache.
+
+**Fix:** `getCache` now returns `undefined` on miss or expiry; returns `T` (including `null`) on a valid cache hit. The company-lookup caller that used `!== null` was updated to `!== undefined`.
+
+### Bug 2: Falsy cache check for array-returning patterns (fixed iteration 7)
+
+US1, US2, US3, and `runPatterns()` in `index.ts` used `if (cached) return cached` to check for cache hits. An empty array `[]` is **falsy** in JavaScript — so a cached "no flags found" result (a real cache hit) was silently discarded, causing BigQuery to be re-queried on every subsequent call for clean CNPJs.
+
+Affected: `patternSplitContracts`, `patternConcentration`, `patternInexigibility`, `runPatterns`.
+
+**Fix:** changed all four to `if (cached !== undefined) return cached`. (US4–US8 already used this pattern since they cache `null` as "nothing found" — they were correct.)
+
+---
+
+## Cross-Pattern Issues
+
+### Overlap between US4 and US5
+- **Before iteration 2**: US5 per-CNPJ would flag solo-bid winners as "always winner", creating confusing double flags.
+- **After iteration 2**: US5 filters to competitive auctions only. A pure solo-bid company gets US4 only; a company that wins competitive auctions at high rates gets US5 only; both behaviors together get both flags independently.
+
+### Overlap between US7 and US8
+- A newborn company with a sudden surge would be flagged by both US7 (age at contract) and US8 (YoY growth). This is intentional and additive — both signals reinforce each other.
+
+### CNPJ matching strategy
+All patterns use `cnpj_basico` (8-digit root) as the joining key. This means **all branches and subsidiaries** of a corporate group are attributed to the same `cnpj_basico`. This can create false positives for large corporations with many legitimate establishments (e.g., Correios, Petrobras) that naturally have contracts across many agencies.
+
+---
+
+## Summary Table
+
+| Pattern | FP Risk | Legal Basis | Fixes Applied |
+|---------|---------|------------|---------------|
+| US1 Split | Medium — multi-item purchasing | Decreto 9.412/2018 / Decreto 11.871/2024 | NULL date guard; year-dependent threshold (R$17.600 ≤2023, R$57.912 2024+); falsy cache check fixed; **batch GROUP BY now includes id_orgao_superior** |
+| US2 Concentration | Medium — specialized markets | CGU 2022 methodology | Added min supplier spend to all 3 implementations; **falsy cache check fixed**; **all 3 now GROUP BY (id+name) — no ministry-name collision** |
+| US3 Inexigibility | High — legitimate exclusive suppliers | TCU Acórdão 1.793/2011 | Fixed grouping by ID; added min value to all 3 implementations; **falsy cache check fixed** |
+| US4 Single Bidder | Medium — specialized/remote markets | OCP 2024 Flag #1 | **cache.ts bug fixed** (getCache null-vs-undefined); **batch now counts all participants (CPF+CNPJ)** — consistent with per-CNPJ |
+| US5 Always Winner | **Was HIGH** (no competitive filter) → Now Medium | OCDE 2021 | Fixed: competitive auctions only; raised thresholds; **cache.ts bug fixed** |
+| US6 Amendment | Medium — inflation clauses | Lei 14.133/2021 art.125 | Added 10× inflation cap; **cache.ts bug fixed**; **construction keyword detection: 1.50× threshold for obras/etc.**; **constructionCount in UI flag** |
+| US7 Newborn | High — spinoffs, restructurings | CGU 2021 guide | **cache.ts bug fixed** (was never querying BigQuery on cache miss) |
+| US8 Surge | Medium — framework agreements, budget cycles | UNODC 2013 | Added consecutive-year guard; **cache.ts bug fixed** |
--- a/docs/schemas.json
+++ b/docs/schemas.json
--- a/docs/wordcloud_attributes.png
+++ b/docs/wordcloud_attributes.png
--- a/docs/wordcloud_attributes.py
+++ b/docs/wordcloud_attributes.py
@@ -0,0 +1,45 @@
+#!/usr/bin/env python3
+import json
+import re
+from collections import Counter
+from wordcloud import WordCloud
+import matplotlib.pyplot as plt
+
+STOPWORDS = {'de', 'do', 'da', 'a', 'ou', 'em', 'e', 'o', 'que', 'das', 'dos', 'nos', 'nas', 'um', 'uma', 'para', 'com', 'não', 'uma', 'à', 'ao', 'os', 'as', 'se', 'na', 'no', 'de', 'do', 'da', 'é', 'ser', 'seu', 'sua', 'isso', 'the', 'of', 'and', 'in', 'to', 'is', 'for', 'on', 'with', 'at', 'by', 'from'}
+
+with open('context/basedosdados-schema.json') as f:
+    schema = json.load(f)
+
+words = []
+for dataset, tables in schema.items():
+    for table, cols in tables.items():
+        for col in cols:
+            name = col.get('name', '').lower()
+            desc = col.get('description', '').lower()
+            if name and len(name) >= 3:
+                words.append(name)
+            if desc:
+                for w in desc.split():
+                    w = re.sub(r'[^a-záàâãéèêíìîóòôõúùûç]', '', w)
+                    if len(w) >= 3 and w not in STOPWORDS:
+                        words.append(w)
+
+word_freq = Counter(words)
+
+wc = WordCloud(
+    width=1600, 
+    height=800, 
+    background_color='white',
+    max_words=200,
+    colormap='viridis',
+    min_font_size=8
+).generate_from_frequencies(word_freq)
+
+plt.figure(figsize=(20, 10))
+plt.imshow(wc, interpolation='bilinear')
+plt.axis('off')
+plt.tight_layout(pad=0)
+plt.savefig('docs/wordcloud_attributes.png', dpi=150, bbox_inches='tight')
+print("Saved docs/wordcloud_attributes.png")
+print(f"Total unique words: {len(word_freq)}")
+print("Top 30:", word_freq.most_common(30))
--- a/docs/wordcloud_datasets.png
+++ b/docs/wordcloud_datasets.png
--- a/docs/wordcloud_datasets.py
+++ b/docs/wordcloud_datasets.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python3
+import json
+from collections import Counter
+from wordcloud import WordCloud
+import matplotlib.pyplot as plt
+
+with open('context/basedosdados-schema.json') as f:
+    schema = json.load(f)
+
+dataset_names = []
+for dataset in schema.keys():
+    parts = dataset.replace('br_', '').replace('mundo_', '').replace('eu_', '').split('_')
+    dataset_names.extend([p for p in parts if len(p) >= 3])
+
+word_freq = Counter(dataset_names)
+
+wc = WordCloud(
+    width=1600, 
+    height=800, 
+    background_color='white',
+    max_words=100,
+    colormap='plasma',
+    min_font_size=10
+).generate_from_frequencies(word_freq)
+
+plt.figure(figsize=(20, 10))
+plt.imshow(wc, interpolation='bilinear')
+plt.axis('off')
+plt.tight_layout(pad=0)
+plt.savefig('docs/wordcloud_datasets.png', dpi=150, bbox_inches='tight')
+print("Saved docs/wordcloud_datasets.png")
+print(f"Total unique words: {len(word_freq)}")
+print("Top 30:", word_freq.most_common(30))
--- a/gera_schemas.py
+++ b/gera_schemas.py
@@ -1,268 +0,0 @@
-import os
-import json
-import sys
-import pyarrow.parquet as pq
-import s3fs
-import boto3
-import duckdb
-from dotenv import load_dotenv
-
-load_dotenv()
-
-S3_ENDPOINT  = os.environ["HETZNER_S3_ENDPOINT"]
-S3_BUCKET    = os.environ["HETZNER_S3_BUCKET"]
-ACCESS_KEY   = os.environ["AWS_ACCESS_KEY_ID"]
-SECRET_KEY   = os.environ["AWS_SECRET_ACCESS_KEY"]
-
-s3_host = S3_ENDPOINT.removeprefix("https://").removeprefix("http://")
-
-# --- boto3 client (listing only, zero egress) ---
-boto = boto3.client(
-    "s3",
-    endpoint_url=S3_ENDPOINT,
-    aws_access_key_id=ACCESS_KEY,
-    aws_secret_access_key=SECRET_KEY,
-)
-
-# --- s3fs filesystem (footer-only reads via pyarrow) ---
-fs = s3fs.S3FileSystem(
-    client_kwargs={"endpoint_url": S3_ENDPOINT},
-    key=ACCESS_KEY,
-    secret=SECRET_KEY,
-)
-
-# ------------------------------------------------------------------ #
-# Phase 1: File inventory via S3 List API (zero data egress)
-# ------------------------------------------------------------------ #
-print("Phase 1: listing S3 objects...")
-paginator = boto.get_paginator("list_objects_v2")
-
-inventory = {}  # "dataset/table" -> {files: [...], total_size: int}
-
-for page in paginator.paginate(Bucket=S3_BUCKET):
-    for obj in page.get("Contents", []):
-        key = obj["Key"]
-        if not key.endswith(".parquet"):
-            continue
-        parts = key.split("/")
-        if len(parts) < 3:
-            continue
-        dataset, table = parts[0], parts[1]
-        dt = f"{dataset}/{table}"
-        if dt not in inventory:
-            inventory[dt] = {"files": [], "total_size_bytes": 0}
-        inventory[dt]["files"].append(key)
-        inventory[dt]["total_size_bytes"] += obj["Size"]
-
-print(f"  Found {len(inventory)} tables across {S3_BUCKET}")
-
-# ------------------------------------------------------------------ #
-# Phase 2: Schema reads — footer only (~30 KB per table)
-# ------------------------------------------------------------------ #
-print("Phase 2: reading parquet footers...")
-
-def fmt_size(b):
-    for unit in ("B", "KB", "MB", "GB", "TB"):
-        if b < 1024 or unit == "TB":
-            return f"{b:.1f} {unit}"
-        b /= 1024
-
-def extract_col_descriptions(schema):
-    """Try to pull per-column descriptions from Arrow metadata."""
-    descriptions = {}
-    meta = schema.metadata or {}
-    # BigQuery exports embed a JSON blob under b'pandas' with column_info
-    pandas_meta_raw = meta.get(b"pandas") or meta.get(b"pandas_metadata")
-    if pandas_meta_raw:
-        try:
-            pm = json.loads(pandas_meta_raw)
-            for col in pm.get("columns", []):
-                name = col.get("name")
-                desc = col.get("metadata", {}) or {}
-                if isinstance(desc, dict) and "description" in desc:
-                    descriptions[name] = desc["description"]
-        except Exception:
-            pass
-    # Also try top-level b'description' or b'schema'
-    for key in (b"description", b"schema", b"BigQuery:description"):
-        val = meta.get(key)
-        if val:
-            try:
-                descriptions["__table__"] = val.decode("utf-8", errors="replace")
-            except Exception:
-                pass
-    return descriptions
-
-schemas = {}
-errors = []
-
-for i, (dt, info) in enumerate(sorted(inventory.items())):
-    dataset, table = dt.split("/", 1)
-    first_file = info["files"][0]
-    s3_path = f"{S3_BUCKET}/{first_file}"
-    try:
-        schema = pq.read_schema(fs.open(s3_path))
-        col_descs = extract_col_descriptions(schema)
-
-        # Build raw metadata dict (decode bytes keys/values)
-        raw_meta = {}
-        if schema.metadata:
-            for k, v in schema.metadata.items():
-                try:
-                    dk = k.decode("utf-8", errors="replace")
-                    dv = v.decode("utf-8", errors="replace")
-                    # Try to parse JSON values
-                    try:
-                        dv = json.loads(dv)
-                    except Exception:
-                        pass
-                    raw_meta[dk] = dv
-                except Exception:
-                    pass
-
-        columns = []
-        for field in schema:
-            col = {
-                "name": field.name,
-                "type": str(field.type),
-                "nullable": field.nullable,
-            }
-            if field.name in col_descs:
-                col["description"] = col_descs[field.name]
-            # Check field-level metadata
-            if field.metadata:
-                for k, v in field.metadata.items():
-                    try:
-                        dk = k.decode("utf-8", errors="replace")
-                        dv = v.decode("utf-8", errors="replace")
-                        if dk in ("description", "DESCRIPTION", "comment"):
-                            col["description"] = dv
-                    except Exception:
-                        pass
-            columns.append(col)
-
-        schemas[f"{dataset}.{table}"] = {
-            "path": f"s3://{S3_BUCKET}/{dataset}/{table}/",
-            "file_count": len(info["files"]),
-            "total_size_bytes": info["total_size_bytes"],
-            "total_size_human": fmt_size(info["total_size_bytes"]),
-            "columns": columns,
-            "metadata": raw_meta,
-        }
-        print(f"  [{i+1}/{len(inventory)}] ✓ {dataset}.{table} ({len(columns)} cols, {fmt_size(info['total_size_bytes'])})")
-    except Exception as e:
-        errors.append({"table": f"{dataset}.{table}", "error": str(e)})
-        print(f"  [{i+1}/{len(inventory)}] ✗ {dataset}.{table}: {e}", file=sys.stderr)
-
-# ------------------------------------------------------------------ #
-# Phase 3: Enrich from br_bd_metadados.bigquery_tables (small table)
-# ------------------------------------------------------------------ #
-META_TABLE = "br_bd_metadados.bigquery_tables"
-meta_dt = "br_bd_metadados/bigquery_tables"
-
-if meta_dt in inventory:
-    print(f"Phase 3: enriching from {META_TABLE}...")
-    try:
-        con = duckdb.connect()
-        con.execute("INSTALL httpfs; LOAD httpfs;")
-        con.execute(f"""
-            SET s3_endpoint='{s3_host}';
-            SET s3_access_key_id='{ACCESS_KEY}';
-            SET s3_secret_access_key='{SECRET_KEY}';
-            SET s3_url_style='path';
-        """)
-        meta_path = f"s3://{S3_BUCKET}/br_bd_metadados/bigquery_tables/*.parquet"
-        # Peek at available columns
-        available = [r[0] for r in con.execute(f"DESCRIBE SELECT * FROM '{meta_path}' LIMIT 1").fetchall()]
-        print(f"  Metadata columns: {available}")
-
-        # Try to find dataset/table description columns
-        desc_col = next((c for c in available if "description" in c.lower()), None)
-        ds_col   = next((c for c in available if c.lower() in ("dataset_id", "dataset", "schema_name")), None)
-        tbl_col  = next((c for c in available if c.lower() in ("table_id", "table_name", "table")), None)
-
-        if desc_col and ds_col and tbl_col:
-            rows = con.execute(f"""
-                SELECT {ds_col}, {tbl_col}, {desc_col}
-                FROM '{meta_path}'
-            """).fetchall()
-            for ds, tbl, desc in rows:
-                key = f"{ds}.{tbl}"
-                if key in schemas and desc:
-                    schemas[key]["table_description"] = desc
-            print(f"  Enriched {len(rows)} table descriptions")
-        else:
-            print(f"  Could not find expected columns (dataset_id, table_id, description) — skipping enrichment")
-        con.close()
-    except Exception as e:
-        print(f"  Enrichment failed: {e}", file=sys.stderr)
-else:
-    print("Phase 3: br_bd_metadados.bigquery_tables not in S3 — skipping enrichment")
-
-# ------------------------------------------------------------------ #
-# Phase 4a: Write schemas.json
-# ------------------------------------------------------------------ #
-print("Phase 4: writing outputs...")
-
-output = {
-    "_meta": {
-        "bucket": S3_BUCKET,
-        "total_tables": len(schemas),
-        "total_size_bytes": sum(v["total_size_bytes"] for v in schemas.values()),
-        "total_size_human": fmt_size(sum(v["total_size_bytes"] for v in schemas.values())),
-        "errors": errors,
-    },
-    "tables": dict(sorted(schemas.items())),
-}
-
-with open("schemas.json", "w", encoding="utf-8") as f:
-    json.dump(output, f, ensure_ascii=False, indent=2)
-
-print(f"  ✓ schemas.json ({len(schemas)} tables)")
-
-# ------------------------------------------------------------------ #
-# Phase 4b: Write file_tree.md
-# ------------------------------------------------------------------ #
-lines = [
-    f"# S3 File Tree: {S3_BUCKET}",
-    "",
-]
-
-# Group by dataset
-datasets_map = {}
-for dt_key, info in sorted(inventory.items()):
-    dataset, table = dt_key.split("/", 1)
-    datasets_map.setdefault(dataset, []).append((table, info))
-
-total_files  = sum(len(v["files"]) for v in inventory.values())
-total_bytes  = sum(v["total_size_bytes"] for v in inventory.values())
-
-for dataset, tables in sorted(datasets_map.items()):
-    ds_bytes = sum(i["total_size_bytes"] for _, i in tables)
-    ds_files = sum(len(i["files"]) for _, i in tables)
-    lines.append(f"## {dataset}/  ({len(tables)} tables, {fmt_size(ds_bytes)}, {ds_files} files)")
-    lines.append("")
-    for table, info in sorted(tables):
-        schema_entry = schemas.get(f"{dataset}.{table}", {})
-        ncols = len(schema_entry.get("columns", []))
-        col_str = f", {ncols} cols" if ncols else ""
-        table_desc = schema_entry.get("table_description", "")
-        desc_str = f" — {table_desc}" if table_desc else ""
-        lines.append(f"  - **{table}/**  ({len(info['files'])} files, {fmt_size(info['total_size_bytes'])}{col_str}){desc_str}")
-    lines.append("")
-
-lines += [
-    "---",
-    f"**Total: {len(inventory)} tables · {fmt_size(total_bytes)} · {total_files} parquet files**",
-]
-
-with open("file_tree.md", "w", encoding="utf-8") as f:
-    f.write("\n".join(lines) + "\n")
-
-print(f"  ✓ file_tree.md ({len(inventory)} tables)")
-print()
-print("Done!")
-print(f"  schemas.json  — full column-level schema dump")
-print(f"  file_tree.md  — bucket tree with sizes")
-if errors:
-    print(f"  {len(errors)} tables failed (see schemas.json _meta.errors)")
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,4 +0,0 @@
-duckdb
-boto3
-python-dotenv
-openai
--- a/scripts/build_ask.sh
+++ b/scripts/build_ask.sh
@@ -0,0 +1,42 @@
+#!/bin/bash
+set -e
+
+cd "$(dirname "$0")"
+
+echo "=== Building ask binary for Linux x86_64 ==="
+echo "Using Debian x86_64 container for native build..."
+
+# Build in an x86_64 Debian container - this gives us a real x86_64 environment
+# so we can build natively without cross-compilation complexity
+# Use ask/ as context to avoid .dockerignore excluding src/
+docker build \
+    --platform linux/amd64 \
+    -t ask-builder \
+    --build-arg BUILDKIT_INLINE_CACHE=1 \
+    -f - ask/ <<'EOF'
+FROM rust:1.85-slim
+
+RUN apt-get update -qq && \
+    apt-get install -y --no-install-recommends \
+        build-essential pkg-config libssl-dev && \
+    apt-get clean && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /build
+
+COPY . ./
+RUN cargo build --release --locked
+
+FROM scratch
+COPY --from=0 /build/target/release/ask /ask
+EOF
+
+echo "=== Extracting binary ==="
+# Extract the binary from the container
+docker run --rm --platform linux/amd64 ask-builder cat /ask > ./ask/target/release/ask
+
+# Make it executable
+chmod +x ./ask/target/release/ask
+
+echo "=== Binary built successfully ==="
+file ./ask/target/release/ask
+ls -lh ./ask/target/release/ask
--- a/scripts/failed_tables.txt
+++ b/scripts/failed_tables.txt
--- a/scripts/gera_schemas.py
+++ b/scripts/gera_schemas.py
--- a/scripts/prepara_db.py
+++ b/scripts/prepara_db.py
--- a/scripts/roda.sh
+++ b/scripts/roda.sh
@@ -62,7 +62,8 @@ if $GCLOUD_RUN; then
      exit 1
    fi
  done
-else
+elif ! $SYNC_RUN; then
+  # Only require heavy GCP tools for the main export (not for --sync)
  for cmd in bq gcloud gsutil parallel rclone flock; do
    if ! command -v "$cmd" &>/dev/null; then
      log_err "'$cmd' not found. Install google-cloud-sdk, GNU parallel, and rclone."
@@ -164,8 +165,8 @@ echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.clou
  | sudo tee /etc/apt/sources.list.d/google-cloud-sdk.list >/dev/null
 sudo apt-get update -qq
 sudo apt-get install -y google-cloud-cli
-chmod +x ~/roda.sh
-echo "Dependencies installed."
+  chmod +x ~/roda.sh
+  echo "Dependencies installed."
 REMOTE_SETUP
  log "  Dependencies ready."

@@ -197,6 +198,121 @@ REMOTE_SETUP
  exit 0
 fi

+# =============================================================================
+# VM EXPORT — use existing bd-export-vm to export specific tables to GCS → S3
+# =============================================================================
+if [[ "${1:-}" == "--vm-export" ]]; then
+  VM_NAME="${GCP_VM_NAME:-bd-export-vm}"
+  VM_ZONE="${GCP_VM_ZONE:-us-central1-a}"
+  VM_PROJECT="${GCP_VM_PROJECT:-raspa-491716}"
+  TABLE_LIST="${2:-missing_tables.txt}"
+
+  log "=============================="
+  log " VM EXPORT MODE"
+  log "  VM: $VM_NAME ($VM_ZONE)"
+  log "  Tables: $TABLE_LIST"
+  log "=============================="
+
+  if [[ ! -f "$TABLE_LIST" ]]; then
+    log_err "Table list not found: $TABLE_LIST"
+    exit 1
+  fi
+
+  log "[1/5] Syncing files to VM..."
+  gcloud compute scp \
+    "$(dirname "$0")/roda.sh" \
+    "$(dirname "$0")/.env" \
+    "$(realpath "$TABLE_LIST")" \
+    "$VM_NAME:~/" \
+    --zone="$VM_ZONE" \
+    --project="$VM_PROJECT"
+
+  log "[2/5] Ensuring GCS bucket exists..."
+  if ! gsutil ls "gs://$BUCKET_NAME" &>/dev/null; then
+    gsutil mb -p "$VM_PROJECT" -l "$BUCKET_REGION" -b on "gs://$BUCKET_NAME"
+    log "  Bucket created: gs://$BUCKET_NAME"
+  else
+    log "  Bucket already exists."
+  fi
+
+  log "[3/5] Running export on VM (bq extract + rclone)..."
+  gcloud compute ssh "$VM_NAME" \
+    --zone="$VM_ZONE" \
+    --project="$VM_PROJECT" \
+    --command="bash -s" <<'REMOTE_EXPORT'
+set -euo pipefail
+export DEBIAN_FRONTEND=noninteractive
+cd ~
+set -a
+# shellcheck source=.env
+source .env
+set +a
+source ~/.bashrc 2>/dev/null || true
+
+export RCLONE_CONFIG_BD_TYPE="google cloud storage"
+export RCLONE_CONFIG_BD_BUCKET_POLICY_ONLY="true"
+export RCLONE_CONFIG_HZ_TYPE="s3"
+export RCLONE_CONFIG_HZ_PROVIDER="Other"
+export RCLONE_CONFIG_HZ_ENDPOINT="$HETZNER_S3_ENDPOINT"
+export RCLONE_CONFIG_HZ_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID"
+export RCLONE_CONFIG_HZ_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY"
+
+echo "[BQ EXTRACT] Starting export of missing tables..."
+
+extract_table() {
+  local table="$1"
+  local dataset table_id gcs_prefix
+  dataset=$(echo "$table" | cut -d. -f1)
+  table_id=$(echo "$table" | cut -d. -f2)
+  gcs_prefix="gs://$BUCKET_NAME/$dataset/$table_id"
+
+  echo "[EXTRACT] $table"
+  bq extract \
+    --project_id="$YOUR_PROJECT" \
+    --destination_format=PARQUET \
+    --compression=ZSTD \
+    --location=US \
+    "${SOURCE_PROJECT}:${dataset}.${table_id}" \
+    "${gcs_prefix}/*.parquet" 2>&1 \
+    || echo "[FAIL] $table"
+}
+
+export -f extract_table
+export BUCKET_NAME SOURCE_PROJECT
+
+cat missing_tables.txt | parallel -j8 --bar extract_table {}
+
+echo "[TRANSFER] GCS → Hetzner S3..."
+datasets=$(gsutil ls "gs://$BUCKET_NAME/" 2>/dev/null | sed 's|gs://[^/]*/||;s|/$||' | grep -v '^$' | sort -u)
+for ds in $datasets; do
+  echo "[TRANSFER] $ds"
+  rclone copy "bd:$BUCKET_NAME/$ds/" "hz:$HETZNER_S3_BUCKET/$ds/" \
+    --transfers 32 --s3-upload-concurrency 32 --progress 2>&1 \
+    || echo "[FAIL_TRANSFER] $ds"
+done
+
+echo "[DONE] Export complete."
+REMOTE_EXPORT
+
+  log "[4/5] Verifying transfer..."
+  S3_COUNT=$(gcloud compute ssh "$VM_NAME" \
+    --zone="$VM_ZONE" \
+    --project="$VM_PROJECT" \
+    --command="source .env && rclone ls hz:\$HETZNER_S3_BUCKET 2>/dev/null | grep -c '\.parquet\$' || echo 0" 2>/dev/null)
+  log "  S3 parquet files: $S3_COUNT"
+
+  log "[5/5] Cleaning up GCS bucket..."
+  read -rp "Delete GCS bucket gs://$BUCKET_NAME? [y/N] " confirm
+  if [[ "$confirm" =~ ^[Yy]$ ]]; then
+    gsutil -m rm -r "gs://$BUCKET_NAME"
+    gsutil rb "gs://$BUCKET_NAME"
+    log "  Bucket deleted."
+  fi
+
+  log "VM export complete."
+  exit 0
+fi
+
 # =============================================================================
 # SYNC — BigQuery → S3 direct (no GCS intermediary)
 # =============================================================================
--- a/scripts/sample_datasets.py
+++ b/scripts/sample_datasets.py
--- a/shell/Caddyfile
+++ b/shell/Caddyfile
--- a/shell/auth.py
+++ b/shell/auth.py
--- a/shell/haloy.yml
+++ b/shell/haloy.yml
--- a/start.sh
+++ b/start.sh
@@ -19,13 +19,13 @@ SQL
 chmod 600 /app/ssh_init.sql

 echo "[start] Starting ttyd terminal (db)..."
-ttyd --port 7681 --writable duckdb -readonly --init /app/ssh_init.sql /app/basedosdados.duckdb &
+ttyd --port 7681 --writable duckdb -readonly --init /app/ssh_init.sql /app/data/basedosdados.duckdb &

 echo "[start] Starting ttyd terminal (ask)..."
-ttyd --port 7682 --writable python3 /app/ask.py &
+ttyd --port 7682 --writable /app/ask &

 echo "[start] Starting auth service..."
-python3 /app/auth.py &
+python3 /app/shell/auth.py &

 echo "[start] Starting Caddy..."
 exec caddy run --config /app/Caddyfile --adapter caddyfile
--- a/sync_bq_to_local.py
+++ b/sync_bq_to_local.py
@@ -1,543 +0,0 @@
-#!/usr/bin/env python3
-"""
-sync_bq_to_local.py
-
-Syncs missing tables from BigQuery (basedosdados project) to Hetzner S3,
-then registers them as DuckDB views.
-
-Usage:
-    python3 sync_bq_to_local.py              # full sync
-    python3 sync_bq_to_local.py --dry-run    # list missing tables only
-    python3 sync_bq_to_local.py --resume      # resume from last run
-
-Prerequisites:
-    gcloud auth application-default login
-    GCP project with billing enabled (free tier: 1 TB/month)
-
-Environment (.env):
-    GCP_PROJECT          - GCP project ID for billing
-    HETZNER_S3_BUCKET   - S3 bucket name
-    HETZNER_S3_ENDPOINT - S3 endpoint URL
-    AWS_ACCESS_KEY_ID    - S3 access key
-    AWS_SECRET_ACCESS_KEY - S3 secret key
-"""
-
-import os
-import sys
-import json
-import argparse
-import logging
-import subprocess
-from datetime import datetime
-from pathlib import Path
-from collections import defaultdict
-from concurrent.futures import ThreadPoolExecutor, as_completed
-
-import boto3
-from botocore.config import Config as BotoConfig
-from google.cloud import bigquery
-
-# ---------------------------------------------------------------------------
-# Logging
-# ---------------------------------------------------------------------------
-LOG_FILE = f"sync_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
-logging.basicConfig(
-    level=logging.INFO,
-    format="%(asctime)s %(levelname)s %(message)s",
-    handlers=[
-        logging.FileHandler(LOG_FILE),
-        logging.StreamHandler(sys.stdout),
-    ],
-)
-log = logging.getLogger(__name__)
-
-# ---------------------------------------------------------------------------
-# Constants
-# ---------------------------------------------------------------------------
-SOURCE_PROJECT = "basedosdados"
-MISSING_TABLES_FILE = "tasks/datasets_to_scrap.md"
-DONE_FILE = "done_sync.txt"
-FAILED_FILE = "failed_sync.txt"
-DATA_DIR = "data"
-PARQUET_DIR = "parquet"
-MAX_RETRIES = 3
-BATCH_SIZE = 1  # export one table at a time to manage memory
-WORKERS = 4  # parallel uploads
-
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-
-def load_env():
-    """Load required environment variables."""
-    from dotenv import load_dotenv
-    load_dotenv()
-
-    required = [
-        "GCP_PROJECT",
-        "HETZNER_S3_BUCKET",
-        "HETZNER_S3_ENDPOINT",
-        "AWS_ACCESS_KEY_ID",
-        "AWS_SECRET_ACCESS_KEY",
-    ]
-    missing = [v for v in required if not os.environ.get(v)]
-    if missing:
-        log.error("Missing env vars: %s", missing)
-        sys.exit(1)
-
-    return {v: os.environ[v] for v in required}
-
-
-def get_s3_client(env):
-    """Create boto3 S3 client configured for Hetzner."""
-    return boto3.client(
-        "s3",
-        endpoint_url=env["HETZNER_S3_ENDPOINT"],
-        aws_access_key_id=env["AWS_ACCESS_KEY_ID"],
-        aws_secret_access_key=env["AWS_SECRET_ACCESS_KEY"],
-        config=BotoConfig(s3={"addressing_style": "path"}),
-    )
-
-
-def get_bq_client():
-    """Create BigQuery client using Application Default Credentials."""
-    try:
-        os.environ["GOOGLE_CLOUD_PROJECT"] = os.environ.get("GCP_PROJECT", "")
-        os.environ["GCLOUD_PROJECT"] = os.environ.get("GCP_PROJECT", "")
-        client = bigquery.Client(project=os.environ.get("GCP_PROJECT", ""))
-        # Test the connection
-        list(client.list_datasets(max_results=1))
-        return client
-    except Exception as e:
-        log.error("BigQuery auth failed: %s", e)
-        log.error("")
-        log.error("Run these commands to authenticate:")
-        log.error("  gcloud auth login")
-        log.error("  gcloud auth application-default login")
-        log.error("  gcloud config set project %s", os.environ.get("GCP_PROJECT", ""))
-        log.error("")
-        log.error("The free tier (1 TB/month) is sufficient — no credit card needed.")
-        sys.exit(1)
-
-
-def list_bq_tables(bq_client):
-    """List all tables in the basedosdados BigQuery project."""
-    log.info("Discovering tables in BigQuery project: %s", SOURCE_PROJECT)
-    tables = {}
-
-    try:
-        datasets = list(bq_client.list_datasets())
-        log.info("Found %d datasets", len(datasets))
-    except Exception as e:
-        log.error("Failed to list datasets: %s", e)
-        sys.exit(1)
-
-    for dataset in datasets:
-        try:
-            tables_list = list(
-                bq_client.list_tables(
-                    f"{SOURCE_PROJECT}.{dataset.dataset_id}",
-                    max_results=10000,
-                )
-            )
-            for t in tables_list:
-                tables[f"{dataset.dataset_id}.{t.table_id}"] = {
-                    "dataset": dataset.dataset_id,
-                    "table": t.table_id,
-                    "full_id": f"{SOURCE_PROJECT}.{dataset.dataset_id}.{t.table_id}",
-                    "schema": [f.name for f in t.schema] if t.schema else [],
-                    "num_bytes": t.num_bytes,
-                    "num_rows": t.num_rows,
-                }
-        except Exception as e:
-            log.warning("Failed to list tables in dataset %s: %s", dataset.dataset_id, e)
-
-    log.info("Total BigQuery tables discovered: %d", len(tables))
-    return tables
-
-
-def list_s3_tables(s3_client, bucket):
-    """List datasets/tables already exported to S3."""
-    log.info("Discovering tables already in S3 bucket: %s", bucket)
-    table_files = defaultdict(lambda: defaultdict(list))
-
-    try:
-        paginator = s3_client.get_paginator("list_objects_v2")
-        for page in paginator.paginate(Bucket=bucket):
-            for obj in page.get("Contents", []):
-                key = obj["Key"]
-                if not key.endswith(".parquet"):
-                    continue
-                parts = key.split("/")
-                if len(parts) >= 3:
-                    dataset, table = parts[0], parts[1]
-                    table_files[dataset][table].append(key)
-    except Exception as e:
-        log.warning("S3 listing error (may be empty bucket): %s", e)
-
-    tables = {}
-    for dataset, t_dict in table_files.items():
-        for table, files in t_dict.items():
-            tables[f"{dataset}.{table}"] = files
-
-    log.info("Total S3 tables discovered: %d", len(tables))
-    return tables
-
-
-def parse_missing_tables_from_md(filepath):
-    """Parse the missing tables from tasks/datasets_to_scrap.md.
-
-    Returns a dict mapping 'dataset.table' -> description.
-    Falls back to None (use all non-S3 tables) if file not found.
-    """
-    if not os.path.exists(filepath):
-        log.warning("Missing file %s, using all non-S3 tables", filepath)
-        return None
-
-    log.info("Parsing missing tables from %s", filepath)
-    with open(filepath) as f:
-        content = f.read()
-
-    missing = {}
-    lines = content.split("\n")
-    i = 0
-
-    def next_nonempty(lines, i):
-        while i < len(lines) and not lines[i].strip():
-            i += 1
-        return i
-
-    while i < len(lines):
-        line = lines[i].strip()
-
-        # Find the Basedosdados.org section
-        if "Basedosdados.org" in line and "Not in basedosdados.duckdb" in line:
-            log.info("Found Basedosdados.org section at line %d", i + 1)
-            i += 1
-            break
-        i += 1
-
-    # Now parse table entries
-    while i < len(lines):
-        line = lines[i].strip()
-
-        # End of section only on top-level ## headers, not ### subsections
-        if line.startswith("## "):
-            break
-
-        # Skip separators and empty lines
-        if not line or line.startswith("---") or "|---" in line:
-            i += 1
-            continue
-
-        # Find rows with backtick-wrapped dataset names (e.g. | `br_abrinq_oca` | ...)
-        if "`" in line and "|" in line:
-            # Split by pipe, strip whitespace and backticks
-            parts = [p.strip().strip("`").strip() for p in line.split("|")]
-            # Filter empty parts
-            parts = [p for p in parts if p]
-
-            if len(parts) >= 2:
-                dataset_raw = parts[0]
-                # Check if it looks like a dataset name (br_*, eu_*, mundo_*, etc.)
-                is_dataset = any(
-                    dataset_raw.startswith(prefix)
-                    for prefix in ("br_", "eu_", "mundo_", "nl_", "world_")
-                )
-
-                if is_dataset:
-                    # parts[1] contains the missing table names (comma-separated)
-                    tables_raw = parts[1]
-                    for tbl in tables_raw.split(","):
-                        tbl = tbl.strip()
-                        # Clean up: remove parenthetical notes, trailing text
-                        if "(" in tbl:
-                            tbl = tbl.split("(")[0].strip()
-                        if tbl and not tbl.startswith("-"):
-                            missing[f"{dataset_raw}.{tbl}"] = f"from {filepath}"
-
-        i += 1
-
-    log.info("Parsed %d missing table references from MD", len(missing))
-    return missing if missing else None
-
-
-def compute_missing_tables(bq_tables, s3_tables, md_missing):
-    """Compute which tables need to be synced."""
-    if md_missing is None:
-        log.info("No MD file, computing diff: BQ - S3")
-        return [
-            (table_id, info)
-            for table_id, info in bq_tables.items()
-            if table_id not in s3_tables
-        ]
-
-    log.info("Computing sync targets: MD missing tables not in S3")
-    targets = []
-    for key, info in bq_tables.items():
-        if key in s3_tables:
-            continue
-        if key in md_missing:
-            targets.append((key, info))
-        else:
-            # Table not in S3 but not in MD missing list
-            # Check if its dataset is partially covered
-            dataset = info["dataset"]
-            table = info["table"]
-            # If any table from this dataset is in MD missing, include it
-            dataset_in_md = any(
-                k.startswith(f"{dataset}.") and k.split(".", 1)[1] in md_missing
-                for k in bq_tables
-            )
-            if not dataset_in_md:
-                targets.append((key, info))
-
-    return targets
-
-
-def estimate_size_mb(num_bytes):
-    """Estimate size in MB."""
-    if num_bytes is None:
-        return "?"
-    return f"{num_bytes / 1_048_576:.1f}"
-
-
-# ---------------------------------------------------------------------------
-# Export logic
-# ---------------------------------------------------------------------------
-
-def sync_table(args, table_id, info, dry_run=False):
-    """Sync a single table: BQ → parquet → S3 → DuckDB view."""
-    bq_client, s3_client, bucket = args
-    dataset = info["dataset"]
-    table = info["table"]
-    full_id = info["full_id"]
-
-    s3_key_prefix = f"{dataset}/{table}"
-
-    if dry_run:
-        size_mb = estimate_size_mb(info.get("num_bytes"))
-        return True, f"[DRY] {dataset}.{table} (~{size_mb} MB)"
-
-    # Step 1: Query from BigQuery
-    log.info("Querying %s from BigQuery", full_id)
-    query = f"SELECT * FROM `{full_id}`"
-
-    try:
-        query_job = bq_client.query(query, location="US")
-        df = query_job.to_dataframe()
-    except Exception as e:
-        return False, f"BQ query failed for {table_id}: {e}"
-
-    if df.empty:
-        return True, f"[SKIP] {table_id} — empty table"
-
-    if df.shape[0] > 10_000_000:
-        log.warning("Table %s has %d rows — may be slow/memory-intensive", table_id, df.shape[0])
-
-    # Step 2: Write to parquet in memory, then upload
-    import io
-    import pyarrow as pa
-    import pyarrow.parquet as pq
-
-    buffer = io.BytesIO()
-    table_pa = pa.Table.from_pandas(df)
-
-    # Write with zstd compression
-    writer = pq.ParquetWriter(
-        buffer,
-        table_pa.schema,
-        compression="zstd",
-        use_dictionary=True,
-    )
-    writer.write_table(table_pa)
-    writer.close()
-    buffer.seek(0)
-
-    s3_key = f"{s3_key_prefix}/{table}.parquet"
-    log.info("Uploading %s → s3://%s/%s (%s, %d rows)",
-             table_id, bucket, s3_key,
-             f"{buffer.getbuffer().nbytes / 1_048_576:.1f} MB",
-             df.shape[0])
-
-    try:
-        s3_client.upload_fileobj(
-            buffer,
-            bucket,
-            s3_key,
-            ExtraArgs={"ContentType": "application/octet-stream"},
-        )
-    except Exception as e:
-        return False, f"S3 upload failed for {table_id}: {e}"
-
-    log.info("[DONE] %s uploaded to s3://%s/%s", table_id, bucket, s3_key)
-    return True, f"[DONE] {table_id}"
-
-
-def update_duckdb_view(env, table_id, info):
-    """Register a new table as a DuckDB view over S3 parquet."""
-    import duckdb
-
-    dataset = info["dataset"]
-    table = info["table"]
-    bucket = env["HETZNER_S3_BUCKET"]
-    endpoint = env["HETZNER_S3_ENDPOINT"].removeprefix("https://").removeprefix("http://")
-    access_key = env["AWS_ACCESS_KEY_ID"]
-    secret_key = env["AWS_SECRET_ACCESS_KEY"]
-
-    # S3 path
-    s3_path = f"s3://{bucket}/{dataset}/{table}/{table}.parquet"
-
-    try:
-        con = duckdb.connect("basedosdados.duckdb", read_only=False)
-        con.execute("INSTALL httpfs; LOAD httpfs;")
-        con.execute(f"SET s3_endpoint='{endpoint}';")
-        con.execute(f"SET s3_access_key_id='{access_key}';")
-        con.execute(f"SET s3_secret_access_key='{secret_key}';")
-        con.execute(f"SET s3_url_style='path';")
-        con.execute(f"CREATE SCHEMA IF NOT EXISTS {dataset}")
-        con.execute(f"""
-            CREATE OR REPLACE VIEW {dataset}.{table} AS
-            SELECT * FROM read_parquet('{s3_path}', hive_partitioning=true, union_by_name=true)
-        """)
-        con.close()
-        log.info("[DUCKDB] View created: %s.%s", dataset, table)
-        return True, None
-    except Exception as e:
-        log.error("[DUCKDB] Failed to create view %s.%s: %s", dataset, table, e)
-        return False, str(e)
-
-
-def run_sync(targets, args, env, dry_run=False, resume=False):
-    """Run the sync for all target tables."""
-    s3_client = get_s3_client(env)
-    bq_client = get_bq_client()
-
-    # Load done/failed tracking
-    done_set = set()
-    if resume:
-        if os.path.exists(DONE_FILE):
-            with open(DONE_FILE) as f:
-                done_set = {l.strip() for l in f if l.strip()}
-            log.info("Resuming: %d tables already done", len(done_set))
-
-    failed_count = 0
-    done_count = 0
-
-    # Filter out already-done tables
-    targets = [(tid, info) for tid, info in targets if tid not in done_set]
-
-    if not targets:
-        log.info("No tables to sync.")
-        return 0, 0
-
-    log.info("Syncing %d tables...", len(targets))
-
-    for i, (table_id, info) in enumerate(targets, 1):
-        log.info("--- [%d/%d] Syncing %s ---", i, len(targets), table_id)
-
-        # Sync BQ → S3
-        ok, msg = sync_table(
-            (bq_client, s3_client, env["HETZNER_S3_BUCKET"]),
-            table_id,
-            info,
-            dry_run=dry_run,
-        )
-        log.info(msg)
-
-        if dry_run:
-            continue
-
-        if not ok:
-            with open(FAILED_FILE, "a") as f:
-                f.write(f"{table_id}\t{msg}\n")
-            failed_count += 1
-            continue
-
-        if "empty" in msg.lower():
-            continue
-
-        # Update DuckDB view
-        ok, err = update_duckdb_view(env, table_id, info)
-        if not ok:
-            with open(FAILED_FILE, "a") as f:
-                f.write(f"{table_id}\tDUCKDB: {err}\n")
-
-        # Mark done
-        with open(DONE_FILE, "a") as f:
-            f.write(f"{table_id}\n")
-        done_count += 1
-
-    return done_count, failed_count
-
-
-# ---------------------------------------------------------------------------
-# Main
-# ---------------------------------------------------------------------------
-
-def main():
-    parser = argparse.ArgumentParser(description="Sync missing BQ tables to S3")
-    parser.add_argument("--dry-run", action="store_true", help="List tables without syncing")
-    parser.add_argument("--resume", action="store_true", help="Resume from last run")
-    args = parser.parse_args()
-
-    env = load_env()
-    dry_run = args.dry_run
-
-    if dry_run:
-        log.info("=== DRY RUN MODE ===")
-
-    # Step 1: List BigQuery tables
-    bq_client = get_bq_client()
-    bq_tables = list_bq_tables(bq_client)
-
-    # Step 2: List S3 tables
-    s3_client = get_s3_client(env)
-    s3_tables = list_s3_tables(s3_client, env["HETZNER_S3_BUCKET"])
-
-    # Step 3: Parse missing tables from MD
-    md_missing = parse_missing_tables_from_md(MISSING_TABLES_FILE)
-
-    # Step 4: Compute targets
-    targets = compute_missing_tables(bq_tables, s3_tables, md_missing)
-
-    if not targets:
-        log.info("No tables to sync.")
-        return
-
-    log.info("")
-    log.info("============================================")
-    log.info(" Tables to sync: %d", len(targets))
-    log.info("============================================")
-    for i, (table_id, info) in enumerate(targets, 1):
-        size_mb = estimate_size_mb(info.get("num_bytes"))
-        md_note = md_missing.get(table_id, "")
-        log.info("  [%d] %-50s %6s MB  %s", i, table_id, size_mb, md_note)
-    log.info("")
-
-    if dry_run:
-        total_bytes = sum(info.get("num_bytes", 0) or 0 for _, info in targets)
-        total_gb = total_bytes / 1_073_741_824
-        log.info("Total estimated size: %.2f GB (BigQuery compressed bytes)", total_gb)
-        log.info("Run without --dry-run to start syncing.")
-        return
-
-    # Step 5: Run sync
-    log.info("Starting sync...")
-    done_count, failed_count = run_sync(targets, None, env, dry_run=False, resume=args.resume)
-
-    log.info("")
-    log.info("============================================")
-    log.info(" Sync complete!")
-    log.info(" Done:    %d tables", done_count)
-    log.info(" Failed:  %d tables", failed_count)
-    log.info(" Log:     %s", LOG_FILE)
-    log.info("============================================")
-
-    if failed_count > 0:
-        log.info("Failed tables: see %s", FAILED_FILE)
-        sys.exit(1)
-
-
-if __name__ == "__main__":
-    main()
--- a/tasks/all_tables.txt
+++ b/tasks/all_tables.txt
--- a/tasks/datasets_to_scrap.md
+++ b/tasks/datasets_to_scrap.md
@@ -143,11 +143,36 @@ Sources from https://github.com/jxnxts/mcp-brasil not in `basedosdados.duckdb`.
 | INPE | `inpe` | none | `https://terrabrasilis.dpi.inpe.br/queimadas/bdqueimadas-data-service` | JSON |
 | Tabua Mares | `tabua_mares` | none | `https://tabuademares.com/api/v2` | JSON |

-## Basedosdados.org — Not in basedosdados.duckdb (232 tables)
+## Basedosdados.org — Not in basedosdados.duckdb

-Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and thus in your duckdb). The following datasets have **zero or partial** tables in duckdb.
+Basedosdados.org has **765 tables** on BigQuery, **~533** on S3. The remaining gap:

-### Full datasets — no tables in duckdb
+- **2 TABLEs** need `bq extract` → GCS → S3 (waiting on GCP billing restore)
+- **~230 are VIEWs** → need `bq query` to materialize, then `bq extract` (or streaming write to S3)
+- **3 tables MISSING** from BQ entirely (br_bcb_sicor microdados_* don't exist)
+
+### Need export — 2 TABLEs blocked on GCP billing
+
+| Dataset | Table | BQ Type | Notes |
+|---------|-------|---------|-------|
+| `br_bcb_taxa_cambio` | taxa_cambio | TABLE | ✅ `bq extract` works |
+| `br_bcb_taxa_selic` | taxa_selic | TABLE | ✅ `bq extract` works |
+
+### Already on S3 (no action needed)
+
+| Dataset | Tables |
+|---------|--------|
+| `br_bd_metadados` | bigquery_tables, prefect_flow_runs |
+| `br_fbsp_absp` | uf, violencia_escola |
+| `br_ibge_estadic` | dicionario |
+| `br_camara_dados_abertos` | all 33 tables (222 parquet files) |
+| `br_me_rais` | dicionario, microdados_estabelecimentos, microdados_vinculos |
+
+### ~230 VIEWs — need bq query materialization pipeline
+
+Cannot `bq extract` directly. Need to: (1) materialize via `bq query --destination_table`, or (2) stream via Python Arrow → S3 directly.
+
+#### Full datasets (all VIEWs)

 | Dataset | Tables missing | Notes |
 |---------|----------------|-------|
@@ -157,7 +182,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_anvisa_medicamentos_industrializados` | microdados | |
 | `br_ba_feiradesantana_camara_leis` | microdados | |
 | `br_bd_diretorios_data_tempo` | tempo, data, ano, mes, dia, hora, bimestre, trimestre, semestre, minuto, segundo | Directory of time dimensions |
-| `br_bd_metadados` | external_links, information_requests, organizations, prefect_flows, resources, tables | BD metadata catalog |
+| `br_bd_metadados` | external_links, information_requests, organizations, resources, tables | |
 | `br_bd_vizinhanca` | municipio, uf | |
 | `br_caixa_sorteios` | megasena | |
 | `br_camara_dados_abertos` | sigla_partido | |
@@ -179,7 +204,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_ieps_saude` | brasil, macrorregiao, municipio, regiao_saude, uf | |
 | `br_imprensa_nacional_dou` | secao_1, secao_2, secao_3 | Official gazette sections |
 | `br_ipea_acesso_oportunidades` | estatisticas_2019, indicadores_2019 | |
-| `br_mapbiomas_estatisticas` | classe, cobertura_municipio_classe, cobertura_uf_classe, transicao_municipio_de_para_anual/decenal/quinquenal, transicao_uf_de_para_anual/decenal/quinquenal | |
+| `br_mapbiomas_estatisticas` | classe, cobertura_municipio_classe, cobertura_uf_classe, transicao_*(anual/decenal/quinquenal) | |
 | `br_mc_indicadores` | transferencias_municipio | |
 | `br_me_clima_organizacional` | microdados | |
 | `br_me_estoque_divida_publica` | microdados | |
@@ -188,7 +213,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_me_siape` | servidores_executivo_federal | |
 | `br_me_siorg` | remuneracao | |
 | `br_mma_extincao` | fauna_ameacada, flora_ameacada | |
-| `br_mobilidados_indicadores` | 11 tables (comprometimento_renda_tarifa_transp_publico, proporcao_*, taxa_motorizacao, etc.) | |
+| `br_mobilidados_indicadores` | 11 tables | |
 | `br_ms_atencao_basica` | municipio | |
 | `br_ms_imunizacoes` | municipio | |
 | `br_ons_energia_armazenada` | subsistemas | |
@@ -219,18 +244,16 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `world_ti_corruption_perception` | country | |
 | `world_wb_wwbi` | country_finance, country_indicators | |

-### Partial datasets — some tables in duckdb, some missing
+#### Partial datasets — missing tables (all VIEWs, except where noted)

 | Dataset | Missing tables | In duckdb |
 |---------|----------------|-----------|
 | `br_anatel_banda_larga_fixa` | backhaul, pble | densidade_*, microdados |
-| `br_bcb_sicor` | microdados_liberacao, microdados_operacao, microdados_saldo | dicionario, liberacao, operacao, saldo, recurso_publico_* |
-| `br_bcb_taxa_cambio` | taxa_cambio | — (ACCESS_DENIED) |
-| `br_bcb_taxa_selic` | taxa_selic | — (ACCESS_DENIED) |
+| `br_bcb_sicor` | microdados_liberacao, microdados_operacao, microdados_saldo | dicionario, liberacao, operacao, saldo, + 5 more TABLEs |
 | `br_ibge_pib` | brasil_antigo, municipio_antigo, regiao_antigo, uf, uf_antigo | gini, municipio |
 | `br_ibge_pnad_covid` | microdados | dicionario |
-| `br_ibge_pnadc` | ano_brasil_grupo_idade, ano_brasil_raca_cor, ano_municipio_*, ano_regiao_*, ano_uf_* (cross-tabs) | dicionario, educacao, microdados, rendimentos_outras_fontes |
-| `br_ibge_pof` | all 17 tables (morador, domicilio, despesa_*, consumo_*, etc.) | none |
+| `br_ibge_pnadc` | 10 cross-tab tables (ano_*) | dicionario, educacao, microdados, rendimentos_outras_fontes |
+| `br_ibge_pof` | all 17 tables (morador_*, domicilio_*, despesa_*, consumo_*, etc.) | none |
 | `br_inep_ana` | aluno, escola, prova | dicionario |
 | `br_inep_censo_escolar` | docente, matricula | dicionario, escola, turma |
 | `br_inep_formacao_docente` | brasil, escola, municipio, regiao, uf | dicionario |
@@ -238,8 +261,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_inep_indicadores_educacionais` | escola_nivel_socioeconomico, fluxo_educacao_superior | all others |
 | `br_inmet_bdmep` | estacao | microdados |
 | `br_me_caged` | microdados_antigos, microdados_antigos_ajustes | dicionario, microdados_movimentacao* |
-| `br_me_cno` | microdados, microdados_cnae, microdados_vinculo | dicionario, microdados |
-| `br_me_rais` | all tables | dicionario, microdados_estabelecimentos, microdados_vinculos |
+| `br_me_cno` | microdados_cnae, microdados_vinculo | dicionario, microdados |
 | `br_mec_prouni` | microdados | dicionario |
 | `br_ms_sim` | municipio, municipio_causa, municipio_causa_idade, municipio_causa_idade_sexo_raca | dicionario, microdados |
 | `br_ms_sinan` | microdados_violencia | dicionario, microdados_dengue, microdados_influenza_srag |
@@ -247,3 +269,13 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_seeg_emissoes` | brasil | dicionario, municipio, uf |
 | `br_tse_eleicoes` | local_secao | all others |
 | `world_oecd_pisa` | dictionary, school_summary, student_summary | student |
+
+### Tables that don't exist in BigQuery (3)
+
+These were listed in datasets_to_scrap but actually don't exist in `basedosdados`:
+
+| Dataset | Table |
+|---------|-------|
+| `br_bcb_sicor` | microdados_liberacao |
+| `br_bcb_sicor` | microdados_operacao |
+| `br_bcb_sicor` | microdados_saldo |
--- a/tasks/missing_tables.txt
+++ b/tasks/missing_tables.txt
@@ -0,0 +1,270 @@
+br_abrinq_oca.municipio_primeira_infancia
+br_ana_atlas_esgotos.municipio
+br_ana_reservatorios.sin
+br_anvisa_medicamentos_industrializados.microdados
+br_ba_feiradesantana_camara_leis.microdados
+br_bd_diretorios_data_tempo.ano
+br_bd_diretorios_data_tempo.bimestre
+br_bd_diretorios_data_tempo.data
+br_bd_diretorios_data_tempo.dia
+br_bd_diretorios_data_tempo.hora
+br_bd_diretorios_data_tempo.mes
+br_bd_diretorios_data_tempo.minuto
+br_bd_diretorios_data_tempo.segundo
+br_bd_diretorios_data_tempo.semestre
+br_bd_diretorios_data_tempo.tempo
+br_bd_diretorios_data_tempo.trimestre
+br_bd_metadados.bigquery_tables
+br_bd_metadados.external_links
+br_bd_metadados.information_requests
+br_bd_metadados.organizations
+br_bd_metadados.prefect_flow_runs
+br_bd_metadados.resources
+br_bd_metadados.tables
+br_bd_vizinhanca.municipio
+br_bd_vizinhanca.uf
+br_caixa_sorteios.megasena
+br_camara_dados_abertos.deputado
+br_camara_dados_abertos.deputado_ocupacao
+br_camara_dados_abertos.deputado_profissao
+br_camara_dados_abertos.despesa
+br_camara_dados_abertos.evento
+br_camara_dados_abertos.evento_orgao
+br_camara_dados_abertos.evento_presenca_deputado
+br_camara_dados_abertos.evento_requerimento
+br_camara_dados_abertos.frente
+br_camara_dados_abertos.frente_deputado
+br_camara_dados_abertos.funcionario
+br_camara_dados_abertos.legislatura
+br_camara_dados_abertos.legislatura_mesa
+br_camara_dados_abertos.licitacao
+br_camara_dados_abertos.licitacao_contrato
+br_camara_dados_abertos.licitacao_item
+br_camara_dados_abertos.licitacao_pedido
+br_camara_dados_abertos.licitacao_proposta
+br_camara_dados_abertos.orgao
+br_camara_dados_abertos.orgao_deputado
+br_camara_dados_abertos.proposicao_autor
+br_camara_dados_abertos.proposicao_microdados
+br_camara_dados_abertos.proposicao_tema
+br_camara_dados_abertos.sigla_partido
+br_camara_dados_abertos.votacao
+br_camara_dados_abertos.votacao_objeto
+br_camara_dados_abertos.votacao_orientacao_bancada
+br_camara_dados_abertos.votacao_parlamentar
+br_camara_dados_abertos.votacao_proposicao
+br_capes_bolsas.mobilidade_internacional
+br_cgu_ebt.municipio
+br_cgu_ebt.uf
+br_cgu_fef.microdados
+br_cgu_fef.municipios_sorteados
+br_cgu_fef.sorteio
+br_cgu_pessoal_executivo_federal.terceirizados
+br_clp_ranking_competitividade.nota_geral_municipio
+br_clp_ranking_competitividade.nota_geral_uf
+br_cnj_estatisticas_poder_judiciario.recursos_financeiros
+br_fbsp_absp.municipio
+br_fbsp_absp.uf
+br_fbsp_absp.violencia_escola
+br_firjan_ifgf.ranking
+br_ggb_relatorio_lgbtqi.brasil
+br_ggb_relatorio_lgbtqi.causa_obito
+br_ggb_relatorio_lgbtqi.grupo_lgbtqia
+br_ggb_relatorio_lgbtqi.local
+br_ggb_relatorio_lgbtqi.raca_cor
+br_ibge_amc.municipio_de_para
+br_ibge_cbo_2002.perfil_ocupacional
+br_ibge_cbo_2002.sinonimo
+br_ibge_estadic.comunicacao_informatica
+br_ibge_estadic.dicionario
+br_ibge_estadic.educacao
+br_ibge_estadic.governanca
+br_ibge_estadic.indicadores_perfil_gestor
+br_ibge_estadic.indicadores_quantidade_vinculo
+br_ibge_estadic.politica_mulher
+br_ibge_estadic.recursos_humanos
+br_ibge_ipp.mes_categoria_economica
+br_ibge_ipp.mes_grupo_industrial
+br_ibge_ipp.mes_industria_atividade
+br_ibge_ipp.mes_industria_extrativa
+br_ibge_ipp.mes_industria_geral
+br_ibge_ipp.mes_industria_transformacao
+br_ibge_munic.indicadores_perfil_gestor
+br_ibge_munic.indicadores_quantidade_vinculo
+br_ibge_nomes_brasil.quantidade_municipio_nome_2010
+br_ieps_saude.brasil
+br_ieps_saude.macrorregiao
+br_ieps_saude.municipio
+br_ieps_saude.regiao_saude
+br_ieps_saude.uf
+br_imprensa_nacional_dou.secao_1
+br_imprensa_nacional_dou.secao_2
+br_imprensa_nacional_dou.secao_3
+br_ipea_acesso_oportunidades.estatisticas_2019
+br_ipea_acesso_oportunidades.indicadores_2019
+br_mapbiomas_estatisticas.classe
+br_mapbiomas_estatisticas.cobertura_municipio_classe
+br_mapbiomas_estatisticas.cobertura_uf_classe
+br_mapbiomas_estatisticas.transicao_municipio_de_para_anual
+br_mapbiomas_estatisticas.transicao_municipio_de_para_decenal
+br_mapbiomas_estatisticas.transicao_municipio_de_para_quinquenal
+br_mapbiomas_estatisticas.transicao_uf_de_para_anual
+br_mapbiomas_estatisticas.transicao_uf_de_para_decenal
+br_mapbiomas_estatisticas.transicao_uf_de_para_quinquenal
+br_mc_indicadores.transferencias_municipio
+br_me_clima_organizacional.microdados
+br_me_estoque_divida_publica.microdados
+br_me_exportadoras_importadoras.dicionario
+br_me_exportadoras_importadoras.estabelecimentos
+br_me_pensionistas.microdados
+br_me_siape.servidores_executivo_federal
+br_me_siorg.remuneracao
+br_mma_extincao.fauna_ameacada
+br_mma_extincao.flora_ameacada
+br_mobilidados_indicadores.comprometimento_renda_tarifa_transp_publico
+br_mobilidados_indicadores.divisao_modal
+br_mobilidados_indicadores.emissao_co2_material_particulado
+br_mobilidados_indicadores.proporcao_domicilios_infra_urbana
+br_mobilidados_indicadores.proporcao_mortes_negras_acidente_transporte
+br_mobilidados_indicadores.proporcao_pessoas_prox_infra_cicloviaria
+br_mobilidados_indicadores.proporcao_pessoas_proximas_pnt
+br_mobilidados_indicadores.taxa_motorizacao
+br_mobilidados_indicadores.tempo_deslocamento_casa_trabalho
+br_mobilidados_indicadores.transporte_media_alta_capacidade
+br_ms_atencao_basica.municipio
+br_ms_imunizacoes.municipio
+br_ons_energia_armazenada.subsistemas
+br_rj_rio_de_janeiro_ipp_ips.dimensoes_componentes
+br_rj_rio_de_janeiro_ipp_ips.indicadores
+br_rj_tce_iegm.indicadores
+br_senado_cpipandemia.discursos
+br_sgp_informacao.despesas_cartao_corporativo
+br_sp_alesp.assessores_lideranca
+br_sp_alesp.assessores_parlamentares
+br_sp_alesp.deputado
+br_sp_alesp.despesas_gabinete
+br_sp_alesp.despesas_gabinete_atual
+br_sp_gov_orcamento.despesa
+br_sp_gov_orcamento.receita_arrecadada
+br_sp_gov_orcamento.receita_prevista
+br_sp_gov_ssp.ocorrencias_registradas
+br_sp_gov_ssp.produtividade_policial
+br_sp_saopaulo_dieese_icv.ano
+br_sp_seduc_fluxo_escolar.escola
+br_sp_seduc_fluxo_escolar.municipio
+br_sp_seduc_idesp.diretoria
+br_sp_seduc_idesp.escola
+br_sp_seduc_idesp.uf
+br_sp_seduc_inse.escola
+br_tpe_classificacao_saeb.categoria
+eu_fra_lgbt.consciencia_direitos
+eu_fra_lgbt.cotidiano
+eu_fra_lgbt.discriminacao
+eu_fra_lgbt.especifico_transgenero
+eu_fra_lgbt.violencia_abuso
+mundo_bm_learning_poverty.pais
+mundo_kaggle_olimpiadas.microdados
+mundo_onu_adh.brasil
+mundo_onu_adh.municipio
+mundo_onu_adh.uf
+mundo_transrespect_transphobia.causa_obito
+mundo_transrespect_transphobia.local
+mundo_transrespect_transphobia.pais
+nl_ug_pwt.microdados
+world_fao_production.country_group
+world_fao_production.crop_livestock
+world_fao_production.dictionary
+world_fao_production.element
+world_fao_production.item
+world_fao_production.item_group
+world_fao_production.production_indices
+world_fao_production.value_agricultural_production
+world_fifa_women_world_cup.matches
+world_fifa_worldcup.award_winners
+world_fifa_worldcup.matches
+world_fifa_worldcup.players
+world_fifa_worldcup.teams
+world_fifa_worldcup.tournaments
+world_gsps_consortium_gsps.global_indicators
+world_slave_voyages_consortium_slave_trade.transatlantic
+world_spi_spi.global_indicators
+world_ti_corruption_perception.country
+world_wb_wwbi.country_finance
+world_wb_wwbi.country_indicators
+br_anatel_banda_larga_fixa.backhaul
+br_anatel_banda_larga_fixa.pble
+br_bcb_sicor.microdados_liberacao
+br_bcb_sicor.microdados_operacao
+br_bcb_sicor.microdados_saldo
+br_bcb_taxa_cambio.taxa_cambio
+br_bcb_taxa_selic.taxa_selic
+br_ibge_pib.brasil_antigo
+br_ibge_pib.municipio_antigo
+br_ibge_pib.regiao_antigo
+br_ibge_pib.uf
+br_ibge_pib.uf_antigo
+br_ibge_pnad_covid.microdados
+br_ibge_pnadc.ano_brasil_grupo_idade
+br_ibge_pnadc.ano_brasil_raca_cor
+br_ibge_pnadc.ano_municipio_grupo_idade
+br_ibge_pnadc.ano_municipio_raca_cor
+br_ibge_pnadc.ano_regiao_grupo_idade
+br_ibge_pnadc.ano_regiao_metropolitana_grupo_idade
+br_ibge_pnadc.ano_regiao_metropolitana_raca_cor
+br_ibge_pnadc.ano_regiao_raca_cor
+br_ibge_pnadc.ano_uf_grupo_idade
+br_ibge_pnadc.ano_uf_raca_cor
+br_ibge_pof.aluguel_estimado_2017
+br_ibge_pof.cadastro_de_produtos_2017
+br_ibge_pof.caderneta_coletiva_2017
+br_ibge_pof.caracteristicas_dieta_2017
+br_ibge_pof.condicoes_vida_2017
+br_ibge_pof.consumo_alimentar_2017
+br_ibge_pof.despesa_coletiva_2017
+br_ibge_pof.despesa_individual_2017
+br_ibge_pof.domicilio_2017
+br_ibge_pof.inventario_2017
+br_ibge_pof.morador_2017
+br_ibge_pof.outros_rendimentos_2017
+br_ibge_pof.rendimento_trabalho_2017
+br_ibge_pof.restricao_saude_2017
+br_ibge_pof.servico_nao_monetario_pof2_2017
+br_ibge_pof.servico_nao_monetario_pof4_2017
+br_inep_ana.aluno
+br_inep_ana.escola
+br_inep_ana.prova
+br_inep_censo_escolar.docente
+br_inep_censo_escolar.matricula
+br_inep_formacao_docente.brasil
+br_inep_formacao_docente.escola
+br_inep_formacao_docente.municipio
+br_inep_formacao_docente.regiao
+br_inep_formacao_docente.uf
+br_inep_indicador_nivel_socioeconomico.brasil
+br_inep_indicador_nivel_socioeconomico.municipio
+br_inep_indicador_nivel_socioeconomico.uf
+br_inep_indicadores_educacionais.escola_nivel_socioeconomico
+br_inep_indicadores_educacionais.fluxo_educacao_superior
+br_inmet_bdmep.estacao
+br_me_caged.microdados_antigos
+br_me_caged.microdados_antigos_ajustes
+br_me_cno.microdados_cnae
+br_me_cno.microdados_vinculo
+br_me_rais.dicionario
+br_me_rais.microdados_estabelecimentos
+br_me_rais.microdados_vinculos
+br_mec_prouni.microdados
+br_ms_sim.municipio
+br_ms_sim.municipio_causa
+br_ms_sim.municipio_causa_idade
+br_ms_sim.municipio_causa_idade_sexo_raca
+br_ms_sinan.microdados_violencia
+br_ms_vacinacao_covid19.microdados
+br_ms_vacinacao_covid19.microdados_estabelecimento
+br_ms_vacinacao_covid19.microdados_paciente
+br_ms_vacinacao_covid19.microdados_vacinacao
+br_seeg_emissoes.brasil
+br_tse_eleicoes.local_secao
+world_oecd_pisa.dictionary
+world_oecd_pisa.school_summary
+world_oecd_pisa.student_summary
--- a/tasks/pending_tables.txt
+++ b/tasks/pending_tables.txt
@@ -0,0 +1,2 @@
+br_bcb_taxa_cambio.taxa_cambio
+br_bcb_taxa_selic.taxa_selic
--- a/tasks/views_to_materialize.txt
+++ b/tasks/views_to_materialize.txt
@@ -0,0 +1,229 @@
+br_abrinq_oca.municipio_primeira_infancia
+br_ana_atlas_esgotos.municipio
+br_ana_reservatorios.sin
+br_anvisa_medicamentos_industrializados.microdados
+br_ba_feiradesantana_camara_leis.microdados
+br_bd_diretorios_data_tempo.ano
+br_bd_diretorios_data_tempo.bimestre
+br_bd_diretorios_data_tempo.data
+br_bd_diretorios_data_tempo.dia
+br_bd_diretorios_data_tempo.hora
+br_bd_diretorios_data_tempo.mes
+br_bd_diretorios_data_tempo.minuto
+br_bd_diretorios_data_tempo.segundo
+br_bd_diretorios_data_tempo.semestre
+br_bd_diretorios_data_tempo.tempo
+br_bd_diretorios_data_tempo.trimestre
+br_bd_metadados.external_links
+br_bd_metadados.information_requests
+br_bd_metadados.organizations
+br_bd_metadados.resources
+br_bd_metadados.tables
+br_bd_vizinhanca.municipio
+br_bd_vizinhanca.uf
+br_caixa_sorteios.megasena
+br_camara_dados_abertos.sigla_partido
+br_capes_bolsas.mobilidade_internacional
+br_cgu_ebt.municipio
+br_cgu_ebt.uf
+br_cgu_fef.microdados
+br_cgu_fef.municipios_sorteados
+br_cgu_fef.sorteio
+br_cgu_pessoal_executivo_federal.terceirizados
+br_clp_ranking_competitividade.nota_geral_municipio
+br_clp_ranking_competitividade.nota_geral_uf
+br_cnj_estatisticas_poder_judiciario.recursos_financeiros
+br_fbsp_absp.municipio
+br_firjan_ifgf.ranking
+br_ggb_relatorio_lgbtqi.brasil
+br_ggb_relatorio_lgbtqi.causa_obito
+br_ggb_relatorio_lgbtqi.grupo_lgbtqia
+br_ggb_relatorio_lgbtqi.local
+br_ggb_relatorio_lgbtqi.raca_cor
+br_ibge_amc.municipio_de_para
+br_ibge_cbo_2002.perfil_ocupacional
+br_ibge_cbo_2002.sinonimo
+br_ibge_estadic.comunicacao_informatica
+br_ibge_estadic.educacao
+br_ibge_estadic.governanca
+br_ibge_estadic.indicadores_perfil_gestor
+br_ibge_estadic.indicadores_quantidade_vinculo
+br_ibge_estadic.politica_mulher
+br_ibge_estadic.recursos_humanos
+br_ibge_ipp.mes_categoria_economica
+br_ibge_ipp.mes_grupo_industrial
+br_ibge_ipp.mes_industria_atividade
+br_ibge_ipp.mes_industria_extrativa
+br_ibge_ipp.mes_industria_geral
+br_ibge_ipp.mes_industria_transformacao
+br_ibge_munic.indicadores_perfil_gestor
+br_ibge_munic.indicadores_quantidade_vinculo
+br_ibge_nomes_brasil.quantidade_municipio_nome_2010
+br_ibge_pib.brasil_antigo
+br_ibge_pib.municipio_antigo
+br_ibge_pib.regiao_antigo
+br_ibge_pib.uf
+br_ibge_pib.uf_antigo
+br_ibge_pnad_covid.microdados
+br_ibge_pnadc.ano_brasil_grupo_idade
+br_ibge_pnadc.ano_brasil_raca_cor
+br_ibge_pnadc.ano_municipio_grupo_idade
+br_ibge_pnadc.ano_municipio_raca_cor
+br_ibge_pnadc.ano_regiao_grupo_idade
+br_ibge_pnadc.ano_regiao_metropolitana_grupo_idade
+br_ibge_pnadc.ano_regiao_metropolitana_raca_cor
+br_ibge_pnadc.ano_regiao_raca_cor
+br_ibge_pnadc.ano_uf_grupo_idade
+br_ibge_pnadc.ano_uf_raca_cor
+br_ibge_pof.aluguel_estimado_2017
+br_ibge_pof.cadastro_de_produtos_2017
+br_ibge_pof.caderneta_coletiva_2017
+br_ibge_pof.caracteristicas_dieta_2017
+br_ibge_pof.condicoes_vida_2017
+br_ibge_pof.consumo_alimentar_2017
+br_ibge_pof.despesa_coletiva_2017
+br_ibge_pof.despesa_individual_2017
+br_ibge_pof.domicilio_2017
+br_ibge_pof.inventario_2017
+br_ibge_pof.morador_2017
+br_ibge_pof.outros_rendimentos_2017
+br_ibge_pof.rendimento_trabalho_2017
+br_ibge_pof.restricao_saude_2017
+br_ibge_pof.servico_nao_monetario_pof2_2017
+br_ibge_pof.servico_nao_monetario_pof4_2017
+br_ieps_saude.brasil
+br_ieps_saude.macrorregiao
+br_ieps_saude.municipio
+br_ieps_saude.regiao_saude
+br_ieps_saude.uf
+br_imprensa_nacional_dou.secao_1
+br_imprensa_nacional_dou.secao_2
+br_imprensa_nacional_dou.secao_3
+br_ipea_acesso_oportunidades.estatisticas_2019
+br_ipea_acesso_oportunidades.indicadores_2019
+br_mapbiomas_estatisticas.classe
+br_mapbiomas_estatisticas.cobertura_municipio_classe
+br_mapbiomas_estatisticas.cobertura_uf_classe
+br_mapbiomas_estatisticas.transicao_municipio_de_para_anual
+br_mapbiomas_estatisticas.transicao_municipio_de_para_decenal
+br_mapbiomas_estatisticas.transicao_municipio_de_para_quinquenal
+br_mapbiomas_estatisticas.transicao_uf_de_para_anual
+br_mapbiomas_estatisticas.transicao_uf_de_para_decenal
+br_mapbiomas_estatisticas.transicao_uf_de_para_quinquenal
+br_mc_indicadores.transferencias_municipio
+br_me_caged.microdados_antigos
+br_me_caged.microdados_antigos_ajustes
+br_me_clima_organizacional.microdados
+br_me_cno.microdados_cnae
+br_me_cno.microdados_vinculo
+br_me_estoque_divida_publica.microdados
+br_me_exportadoras_importadoras.dicionario
+br_me_exportadoras_importadoras.estabelecimentos
+br_me_pensionistas.microdados
+br_me_siape.servidores_executivo_federal
+br_me_siorg.remuneracao
+br_mec_prouni.microdados
+br_mma_extincao.fauna_ameacada
+br_mma_extincao.flora_ameacada
+br_mobilidados_indicadores.comprometimento_renda_tarifa_transp_publico
+br_mobilidados_indicadores.divisao_modal
+br_mobilidados_indicadores.emissao_co2_material_particulado
+br_mobilidados_indicadores.proporcao_domicilios_infra_urbana
+br_mobilidados_indicadores.proporcao_mortes_negras_acidente_transporte
+br_mobilidados_indicadores.proporcao_pessoas_prox_infra_cicloviaria
+br_mobilidados_indicadores.proporcao_pessoas_proximas_pnt
+br_mobilidados_indicadores.taxa_motorizacao
+br_mobilidados_indicadores.tempo_deslocamento_casa_trabalho
+br_mobilidados_indicadores.transporte_media_alta_capacidade
+br_ms_atencao_basica.municipio
+br_ms_imunizacoes.municipio
+br_ms_sim.municipio
+br_ms_sim.municipio_causa
+br_ms_sim.municipio_causa_idade
+br_ms_sim.municipio_causa_idade_sexo_raca
+br_ms_sinan.microdados_violencia
+br_ms_vacinacao_covid19.microdados
+br_ms_vacinacao_covid19.microdados_estabelecimento
+br_ms_vacinacao_covid19.microdados_paciente
+br_ms_vacinacao_covid19.microdados_vacinacao
+br_ons_energia_armazenada.subsistemas
+br_rj_rio_de_janeiro_ipp_ips.dimensoes_componentes
+br_rj_rio_de_janeiro_ipp_ips.indicadores
+br_rj_tce_iegm.indicadores
+br_seeg_emissoes.brasil
+br_senado_cpipandemia.discursos
+br_sgp_informacao.despesas_cartao_corporativo
+br_sp_alesp.assessores_lideranca
+br_sp_alesp.assessores_parlamentares
+br_sp_alesp.deputado
+br_sp_alesp.despesas_gabinete
+br_sp_alesp.despesas_gabinete_atual
+br_sp_gov_orcamento.despesa
+br_sp_gov_orcamento.receita_arrecadada
+br_sp_gov_orcamento.receita_prevista
+br_sp_gov_ssp.ocorrencias_registradas
+br_sp_gov_ssp.produtividade_policial
+br_sp_saopaulo_dieese_icv.ano
+br_sp_seduc_fluxo_escolar.escola
+br_sp_seduc_fluxo_escolar.municipio
+br_sp_seduc_idesp.diretoria
+br_sp_seduc_idesp.escola
+br_sp_seduc_idesp.uf
+br_sp_seduc_inse.escola
+br_tpe_classificacao_saeb.categoria
+br_tse_eleicoes.local_secao
+eu_fra_lgbt.consciencia_direitos
+eu_fra_lgbt.cotidiano
+eu_fra_lgbt.discriminacao
+eu_fra_lgbt.especifico_transgenero
+eu_fra_lgbt.violencia_abuso
+mundo_bm_learning_poverty.pais
+mundo_kaggle_olimpiadas.microdados
+mundo_onu_adh.brasil
+mundo_onu_adh.municipio
+mundo_onu_adh.uf
+mundo_transrespect_transphobia.causa_obito
+mundo_transrespect_transphobia.local
+mundo_transrespect_transphobia.pais
+nl_ug_pwt.microdados
+world_fao_production.country_group
+world_fao_production.crop_livestock
+world_fao_production.dictionary
+world_fao_production.element
+world_fao_production.item
+world_fao_production.item_group
+world_fao_production.production_indices
+world_fao_production.value_agricultural_production
+world_fifa_women_world_cup.matches
+world_fifa_worldcup.award_winners
+world_fifa_worldcup.matches
+world_fifa_worldcup.players
+world_fifa_worldcup.teams
+world_fifa_worldcup.tournaments
+world_gsps_consortium_gsps.global_indicators
+world_oecd_pisa.dictionary
+world_oecd_pisa.school_summary
+world_oecd_pisa.student_summary
+world_slave_voyages_consortium_slave_trade.transatlantic
+world_spi_spi.global_indicators
+world_ti_corruption_perception.country
+world_wb_wwbi.country_finance
+world_wb_wwbi.country_indicators
+br_anatel_banda_larga_fixa.backhaul
+br_anatel_banda_larga_fixa.pble
+br_inep_ana.aluno
+br_inep_ana.escola
+br_inep_ana.prova
+br_inep_censo_escolar.docente
+br_inep_censo_escolar.matricula
+br_inep_formacao_docente.brasil
+br_inep_formacao_docente.escola
+br_inep_formacao_docente.municipio
+br_inep_formacao_docente.regiao
+br_inep_formacao_docente.uf
+br_inep_indicador_nivel_socioeconomico.brasil
+br_inep_indicador_nivel_socioeconomico.municipio
+br_inep_indicador_nivel_socioeconomico.uf
+br_inep_indicadores_educacionais.escola_nivel_socioeconomico
+br_inep_indicadores_educacionais.fluxo_educacao_superior
+br_inmet_bdmep.estacao