refactor: reorganize project structure and fix broken references

- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.) - Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml) - Move basedosdados.duckdb to data/ directory - Update Dockerfile and start.sh with new file paths - Update README.md with correct script paths - Remove Python ask.py (replaced by Rust binary in ask/ask) - Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs) - Remove sentence-transformer dependencies from ask - Move docs and context artifacts to their directories
2026-03-29 20:46:27 +02:00
parent 02cb13362c
commit ed5fa6756e
43 changed files with 302366 additions and 1093 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -3,6 +3,5 @@
 logs/
 done_tables.txt
 done_transfers.txt
 # CocoIndex Code (ccc)
 /.cocoindex_code/
 **/target
 *.log
--- a/5
+++ b/5
@@ -28,8 +28,9 @@ ENV PATH="/root/.cargo/bin:${PATH}" \
 WORKDIR /app
-COPY basedosdados.duckdb Caddyfile start.sh auth.py ask.py ./
+COPY data/basedosdados.duckdb shell/Caddyfile shell/auth.py start.sh ./
-RUN chmod +x start.sh
+COPY ask/ask /app/ask
 RUN chmod +x start.sh /app/ask
 EXPOSE 8080
--- a/README.md
+++ b/README.md
@@ -8,11 +8,13 @@ Os dados foram exportados do BigQuery para o Hetzner Object Storage (Helsinki) n
 ## Consultando os dados
-Acesso via browser ou curl, protegido por senha. Peça a senha para o administrador.
+Acesso via browser ou curl, protegido por senha - peça!
 ### Shell no browser
-Acesse **https://db.xn--2dk.xyz** → autentique → shell DuckDB interativo direto no browser.
+Acesse **https://db.ミ.xyz** → autentique → shell DuckDB interativo direto no browser.
 Use `.tables` para listar os datasets.
 ### SQL via curl
@@ -46,35 +48,6 @@ curl -s -X POST https://db.xn--2dk.xyz/query \
  --data-binary @query.sql > resultado.csv
 ```
 ### Descobrindo tabelas
 ```sql
 -- listar todos os datasets (schemas)
 SHOW SCHEMAS;
 -- listar tabelas de um dataset
 SHOW TABLES IN br_anatel_banda_larga_fixa;
 -- ver colunas de uma tabela
 DESCRIBE br_anatel_banda_larga_fixa.densidade_brasil;
 ```
 No shell do browser, `.tables` lista tudo de uma vez.
 ### Exportar em CSV ou JSON
 O DuckDB permite formatar a saída diretamente na query:
 ```sql
 -- CSV com header (pipe para arquivo via curl)
 COPY (SELECT * FROM br_ibge_censo2022.municipios LIMIT 1000)
 TO '/dev/stdout' (FORMAT csv, HEADER true);
 -- JSON
 SELECT * FROM br_ibge_censo2022.municipios LIMIT 10
 FORMAT JSON;
 ```
 ---
 ## Exploração local
@@ -82,11 +55,11 @@ FORMAT JSON;
 Para rodar as queries na sua própria máquina com DuckDB instalado:
 ```bash
-python prepara_db.py   # gera basedosdados.duckdb com views apontando para o S3
+duckdb data/basedosdados.duckdb
 duckdb basedosdados.duckdb
 ```
 As queries são executadas diretamente sobre os arquivos Parquet no S3 — não há download de dados. O DuckDB lê os arquivos remotos sob demanda via `httpfs`.
 Precisa da credencial da .env - peça!
 ---
@@ -94,62 +67,52 @@ As queries são executadas diretamente sobre os arquivos Parquet no S3 — não
 Interface TUI que permite fazer perguntas em português e obter SQL automaticamente.
 ### Arquitetura
 ```
 Pergunta → [schema filtrado] → LLM local (sqlcoder) ou API externa
         → SQL
 ```
 1. **Schema filtrado**: As tabelas relevantes são filtradas e enviadas ao LLM
 2. **Geração SQL**: Modelo local (sqlcoder via Ollama) ou API externa (Gemini/OpenRouter)
 ### No browser
-Acesse **https://ask.xn--2dk.xyz** → autentique → digite sua pergunta em português.
+Acesse **https://ask.ミ.xyz** → autentique → digite sua pergunta em português.
 ### Local
 ```bash
 # Compilar
 cd ask
 cargo build --release
-./target/release/ask                    # modo interativo
+
-./target/release/ask "Quantos municípios tem SP?"  # modo CLI
+# Modo interativo (TUI)
 ./target/release/ask
 # Modo CLI
 ./target/release/ask "Quantos municípios tem SP?"
 ```
 ### Variáveis de ambiente
-| Variável | Descrição |
+| Variável | Padrão | Descrição |
-|---|---|
+|---|---|---|
-| `GEMINI_API_KEY` | Chave da API Gemini (obrigatória para usar modelos Gemini) |
+| `SQL_GENERATOR` | `gemini` | Generator: `sqlcoder`, `gemini`, ou `openrouter` |
-| `OPENROUTER_API_KEY` | Chave para usar modelos via OpenRouter |
+| `GEMINI_API_KEY` | — | Chave API Gemini (obrigatória se usar gemini) |
-| `GEMINI_MODEL` | Modelo a usar (padrão: `gemini-flash-latest`) |
+| `OPENROUTER_API_KEY` | — | Chave API OpenRouter (obrigatória se usar openrouter) |
-| `SCHEMA_FILE` | Arquivo de schema (padrão: `context/schema_compact_inline.txt`) |
+| `GEMINI_MODEL` | `gemini-flash-lash` | Modelo Gemini |
-| `DB_FILE` | Arquivo DuckDB (padrão: `basedosdados.duckdb`) |
+| `OPENROUTER_MODEL` | `openai/gpt-4o-mini` | Modelo OpenRouter |
 | `OLLAMA_MODEL` | `sqlcoder` | Modelo Ollama (sqlcoder ou sqlcoder:14b) |
 | `OLLAMA_HOST` | `http://localhost:11434` | Host Ollama |
 | `TOP_K_TABLES` | `5` | Número de tabelas a selecionar |
 | `SCHEMA_FILE` | `context/schema_compact_inline.txt` | Schema texto para fallback |
 | `SCHEMA_JSON` | `context/basedosdados-schema.json` | Schema JSON completo |
 | `DB_FILE` | `data/basedosdados.duckdb` | Arquivo DuckDB |
 ---
 ## Arquivos de schema
 O diretório `context/` contém artefatos gerados automaticamente para contexto do LLM e descoberta de tabelas:
 | Arquivo | Descrição |
 |---|---|
 | `schema_compact_inline.txt` | Schema condensado para contexto do LLM |
 | `schema_compact.txt` | Schema mais verboso |
 | `schema_ddl.sql` | DDL das views DuckDB |
 | `join_graph.json` | Relacionamentos entre tabelas |
 | `file_tree.md` | Estrutura de arquivos no S3 com tamanhos |
 | `schemas.json` | Schema raw do BigQuery |
 ---
 ## Descobrindo tabelas
 ```sql
 -- listar todos os datasets (schemas)
 SHOW SCHEMAS;
 -- listar tabelas de um dataset
 SHOW TABLES IN br_anatel_banda_larga_fixa;
 -- ver colunas de uma tabela
 DESCRIBE br_anatel_banda_larga_fixa.densidade_brasil;
 ```
 No shell do browser, `.tables` lista tudo de uma vez. Para descoberta programática, use os arquivos em `context/`.
 ---
 ## Pipeline de exportação
@@ -172,8 +135,8 @@ Resume automático: se interrompido, basta rodar novamente.
 | Script | Função |
 |---|---|
-| `roda.sh` | Pipeline principal de exportação |
+| `scripts/roda.sh` | Pipeline principal de exportação |
-| `prepara_db.py` | Gera `basedosdados.duckdb` com views para todas as tabelas |
+| `scripts/prepara_db.py` | Gera `data/basedosdados.duckdb` com views para todas as tabelas |
 ### Configuração (`.env`)
@@ -196,10 +159,10 @@ Resume automático: se interrompido, basta rodar novamente.
 ### Executando
 ```bash
-chmod +x roda.sh
+chmod +x scripts/roda.sh
-./roda.sh --dry-run    # estima tamanho e custo
+./scripts/roda.sh --dry-run    # estima tamanho e custo
-./roda.sh              # execução local
+./scripts/roda.sh              # execução local
-./roda.sh --gcloud-run # cria VM no GCP, roda lá e deleta ao final
+./scripts/roda.sh --gcloud-run # cria VM no GCP, roda lá e deleta ao final
 ```
 Autenticação GCP necessária antes da primeira exportação:
@@ -219,8 +182,8 @@ Cria uma VM `e2-standard-4` Debian 12 em `us-central1-a`, copia o script e o `.e
 | `GCP_VM_NAME` | `bd-export-vm` | Nome da instância |
 | `GCP_VM_ZONE` | `us-central1-a` | Zona do Compute Engine |
-### Deploy do servidor
+### Deploy do servidor para serviços de db e ask
 ```bash
-haloy deploy
+haloy deploy -f shell/haloy.yml
 ```
--- a/ask.py
+++ b/ask.py
@@ -1,129 +0,0 @@
 #!/usr/bin/env python3
 """
 ask.py — Send a Portuguese question to Gemini and get back SQL.
 Usage:
    python ask.py "Quantos pedidos foram feitos por cliente no último mês?"
    python ask.py "Qual a taxa de mortalidade infantil por município em 2020?"
 Env vars:
    GEMINI_API_KEY  — required
    SCHEMA_FILE     — path to DDL file (default: context/schema_compact_inline.txt)
    GEMINI_MODEL    — model slug (default: gemini-2.0-flash-latest)
 """
 import os
 import sys
 import json
 import requests
 import duckdb
 from dotenv import load_dotenv
 load_dotenv()
 SCHEMA_FILE = os.getenv("SCHEMA_FILE", "context/schema_compact_inline.txt")
 MODEL = os.getenv("GEMINI_MODEL", "gemini-flash-latest")
 DB_FILE = os.getenv("DB_FILE", "basedosdados.duckdb")
 def load_schema(path: str) -> str:
    with open(path, "r", encoding="utf-8") as f:
        return f.read()
 def ask(question: str) -> str:
    api_key = os.getenv("GEMINI_API_KEY")
    if not api_key:
        sys.exit("Error: GEMINI_API_KEY not set")
    schema_ddl = load_schema(SCHEMA_FILE)
    system_prompt = (
        "You are a SQL expert for Base dos Dados (basedosdados.org), "
        "a Brazilian open data warehouse with tables accessed via DuckDB.\n\n"
        "Rules:\n"
        "- Use DuckDB syntax. Tables are referenced as dataset.table.\n"
        "- Only use columns from the provided DDL — never invent column names.\n"
        "- Add WHERE filters on ano, sigla_uf, or id_municipio whenever possible.\n"
        "- Return ONLY the SQL query, no explanation, no markdown fences.\n\n"
        f"Schema DDL:\n\n{schema_ddl}"
    )
    url = (
        f"https://generativelanguage.googleapis.com/v1beta/models"
        f"/{MODEL}:generateContent"
    )
    payload = {
        "system_instruction": {
            "parts": [{"text": system_prompt}]
        },
        "contents": [
            {
                "parts": [{"text": question}]
            }
        ]
    }
    response = requests.post(
        url,
        headers={
            "Content-Type": "application/json",
            "X-goog-api-key": api_key,
        },
        data=json.dumps(payload),
        timeout=300,
    )
    response.raise_for_status()
    result = response.json()
    return result["candidates"][0]["content"]["parts"][0]["text"].strip()
 def main():
    if len(sys.argv) < 2:
        print(f"Usage: python {sys.argv[0]} \"<pergunta em português>\"", file=sys.stderr)
        sys.exit(1)
    question = " ".join(sys.argv[1:])
    print(f"Question: {question}\n", file=sys.stderr)
    print(f"Model:    {MODEL}\n", file=sys.stderr)
    sql = ask(question)
    print(f"\n── SQL ──────────────────────────────────────────\n{sql}\n", file=sys.stderr)
    con = duckdb.connect(DB_FILE, read_only=True)
    rel = con.sql(sql)
    # box mode: build borders from column names + data
    cols = rel.columns
    rows = rel.fetchall()
    if not rows:
        print("(no rows returned)")
        return
    col_widths = [len(c) for c in cols]
    for row in rows:
        for i, val in enumerate(row):
            col_widths[i] = max(col_widths[i], len(str(val) if val is not None else "NULL"))
    def bar(left, mid, right, fill="─"):
        return left + mid.join(fill * (w + 2) for w in col_widths) + right
    header = "│" + "│".join(f" {c:{w}} " for c, w in zip(cols, col_widths)) + "│"
    print(bar("┌", "┬", "┐"))
    print(header)
    print(bar("├", "┼", "┤"))
    for row in rows:
        vals = [str(v) if v is not None else "NULL" for v in row]
        print("│" + "│".join(f" {v:{w}} " for v, w in zip(vals, col_widths)) + "│")
    print(bar("└", "┴", "┘"))
    print(f"\n{len(rows)} row(s)")
 if __name__ == "__main__":
    main()
--- a/ask/.dockerignore
+++ b/ask/.dockerignore
@@ -0,0 +1 @@
 target
--- a/ask/Cargo.lock
+++ b/ask/Cargo.lock
@@ -252,6 +252,7 @@ dependencies = [
 "duckdb",
 "ratatui",
 "reqwest",
 "serde",
 "serde_json",
 "syntect",
 "tui-textarea",
--- a/ask/Cargo.toml
+++ b/ask/Cargo.toml
@@ -9,6 +9,7 @@ path = "src/main.rs"
 [dependencies]
 reqwest    = { version = "0.12", features = ["blocking", "rustls-tls", "json"], default-features = false }
 serde      = { version = "1", features = ["derive"] }
 serde_json = "1"
 duckdb     = { version = "1", features = ["bundled"] }
 dotenvy    = "0.15"
--- a/ask/ask
+++ b/ask/ask
--- a/ask/src/main.rs
+++ b/ask/src/main.rs
@@ -1,4 +1,9 @@
 mod schema_filter;
 mod sql_generator;
 mod table_selector;
 use anyhow::{Context, Result};
 use chrono::Utc;
 use crossterm::{
    event::{
        DisableBracketedPaste, DisableMouseCapture, EnableBracketedPaste, EnableMouseCapture,
@@ -9,14 +14,12 @@ use crossterm::{
 };
 use duckdb::Connection;
 use ratatui::{
    buffer::Buffer,
    layout::{Constraint, Direction, Layout, Rect},
    style::{Color, Modifier, Style},
    text::{Line, Span},
    widgets::{Block, Borders, Gauge, Paragraph, Row, Table, TableState, Wrap},
    Frame, Terminal,
 };
 use chrono::Utc;
 use serde_json::{json, Value};
 use std::{
    env, fs,
@@ -43,6 +46,10 @@ struct Config {
    schema: String,
    db_file: String,
    prompt_file: String,
    use_table_selection: bool,
    embeddings_file: String,
    schema_json: String,
    similarity_threshold: f32,
 }
 enum Phase {
@@ -234,10 +241,23 @@ fn spawn_worker(
    model: String,
    prompt_file: String,
    db_file: String,
    use_table_selection: bool,
    embeddings_file: String,
    schema_json: String,
    similarity_threshold: f32,
 ) -> mpsc::Receiver<WorkerMsg> {
    let (tx, rx) = mpsc::channel::<WorkerMsg>();
-    std::thread::spawn(
+    std::thread::spawn(move || {
-        move || match ask_model(&question, &schema, &model, &prompt_file) {
+        match ask_model_with_selection(
            &question,
            &schema,
            &model,
            &prompt_file,
            use_table_selection,
            &embeddings_file,
            &schema_json,
            similarity_threshold,
        ) {
            Err(e) => {
                let err = format!("{:#}", e);
                log_question(&question, "", false, Some(&err));
@@ -257,8 +277,8 @@ fn spawn_worker(
                    }
                }
            }
-        },
+        }
-    );
+    });
    rx
 }
@@ -270,6 +290,10 @@ fn spawn_retry_worker(
    model: String,
    prompt_file: String,
    db_file: String,
    use_table_selection: bool,
    embeddings_file: String,
    schema_json: String,
    similarity_threshold: f32,
 ) -> mpsc::Receiver<WorkerMsg> {
    let retry_q = format!(
        "{}\n\nO SQL que você gerou falhou com este erro DuckDB:\n```\n{}\n```\n\n\
@@ -277,7 +301,17 @@ fn spawn_retry_worker(
         Corrija o SQL. Retorne APENAS o SQL corrigido, sem explicação.",
        question, error, failed_sql
    );
-    spawn_worker(retry_q, schema, model, prompt_file, db_file)
+    spawn_worker(
        retry_q,
        schema,
        model,
        prompt_file,
        db_file,
        use_table_selection,
        embeddings_file,
        schema_json,
        similarity_threshold,
    )
 }
 // ── event handling ────────────────────────────────────────────────────────────
@@ -327,6 +361,10 @@ impl App {
            self.config.model.clone(),
            self.config.prompt_file.clone(),
            self.config.db_file.clone(),
            self.config.use_table_selection,
            self.config.embeddings_file.clone(),
            self.config.schema_json.clone(),
            self.config.similarity_threshold,
        ));
    }
@@ -398,6 +436,10 @@ impl App {
                        self.config.model.clone(),
                        self.config.prompt_file.clone(),
                        self.config.db_file.clone(),
                        self.config.use_table_selection,
                        self.config.embeddings_file.clone(),
                        self.config.schema_json.clone(),
                        self.config.similarity_threshold,
                    ));
                    self.last_sql.clear();
                } else {
@@ -723,7 +765,12 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                let col_max_widths: Vec<usize> = (0..col_count)
                    .map(|i| {
                        let header_len = cols[i].len();
-                        let data_len = rows.iter().filter_map(|r| r.get(i)).map(|c| c.len()).max().unwrap_or(0);
+                        let data_len = rows
                            .iter()
                            .filter_map(|r| r.get(i))
                            .map(|c| c.len())
                            .max()
                            .unwrap_or(0);
                        (header_len.max(data_len)).max(min_col_width as usize)
                    })
                    .collect();
@@ -732,16 +779,24 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                let use_wrap = total_needed > available_width as usize;
                if use_wrap {
-                    let wrap_width = (available_width as usize / col_count).max(min_col_width as usize);
+                    let wrap_width =
-                    let header_lines: Vec<Line> = cols.iter()
+                        (available_width as usize / col_count).max(min_col_width as usize);
                    let header_lines: Vec<Line> = cols
                        .iter()
                        .enumerate()
                        .map(|(i, c)| {
                            let wrapped = wrap_text(c, wrap_width);
-                            Line::from(wrapped)
+                            let spans: Vec<Span> =
                                wrapped.into_iter().map(|s| Span::raw(s)).collect();
                            Line::from(spans)
                        })
                        .collect();
-                    let max_header_lines = header_lines.iter().map(|l| l.len()).max().unwrap_or(1);
+                    let max_header_lines = header_lines
                        .iter()
                        .map(|l| l.spans.len())
                        .max()
                        .unwrap_or(1);
                    let mut all_row_lines: Vec<Vec<Line>> = Vec::new();
                    for row in rows {
@@ -749,19 +804,19 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                            .map(|i| {
                                let cell = row.get(i).map(|s| s.as_str()).unwrap_or("");
                                let wrapped = wrap_text(cell, wrap_width);
-                                Line::from(wrapped)
+                                let spans: Vec<Span> =
                                    wrapped.into_iter().map(|s| Span::raw(s)).collect();
                                Line::from(spans)
                            })
                            .collect();
-                        let max_lines = row_lines.iter().map(|l| l.len()).max().unwrap_or(1);
+                        let max_lines = row_lines.iter().map(|l| l.spans.len()).max().unwrap_or(1);
                        all_row_lines.push(row_lines);
                    }
                    let selected_idx = table_state.selected().unwrap_or(0);
                    let table_title = format!(" Resultados  ({}/{}) ", selected_idx + 1, n);
-                    let block = Block::default()
+                    let block = Block::default().borders(Borders::ALL).title(table_title);
                        .borders(Borders::ALL)
                        .title(table_title);
                    let area = chunks[2];
                    f.render_widget(block, area);
@@ -778,29 +833,32 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                    let start_row = if n > visible_rows as usize {
                        let scroll = selected_idx as i32 - visible_rows as i32 / 2;
-                        scroll.max(0) as usize.min(n.saturating_sub(visible_rows as usize))
+                        (scroll.max(0) as usize).min(n.saturating_sub(visible_rows as usize))
                    } else {
                        0
                    };
-                    let header_bg = Style::default().fg(Color::Yellow).add_modifier(Modifier::BOLD);
+                    let header_bg = Style::default()
                        .fg(Color::Yellow)
                        .add_modifier(Modifier::BOLD);
                    for (col_idx, header_line) in header_lines.iter().enumerate() {
                        let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1);
                        let col_width = wrap_width as u16;
-                        for (line_idx, line) in header_line.iter().enumerate() {
+                        for (line_idx, span) in header_line.spans.iter().enumerate() {
                            let y = inner_area.y + line_idx as u16;
                            if y >= inner_area.y + inner_area.height {
                                break;
                            }
-                            let spans: Vec<Span> = line.spans.iter().map(|s| {
+                            let styled_span = Span::styled(span.content.clone(), header_bg);
-                                Span::styled(s.content.clone(), header_bg)
+                            f.render_widget(
-                            }).collect();
+                                Paragraph::new(Line::from(styled_span)),
-                            f.render_widget(Paragraph::new(Line::from(spans)), Rect {
+                                Rect {
                                    x: col_x,
                                    y,
                                    width: col_width,
                                    height: 1,
-                            });
+                                },
                            );
                        }
                    }
@@ -811,7 +869,9 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                        }
                        let is_selected = row_idx == selected_idx;
                        let row_style = if is_selected {
-                            Style::default().bg(Color::DarkGray).add_modifier(Modifier::BOLD)
+                            Style::default()
                                .bg(Color::DarkGray)
                                .add_modifier(Modifier::BOLD)
                        } else {
                            Style::default()
                        };
@@ -820,20 +880,21 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                        for (col_idx, cell_lines) in row_lines.iter().enumerate() {
                            let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1);
                            let col_width = wrap_width as u16;
-                            for (line_idx, line) in cell_lines.iter().enumerate() {
+                            for (line_idx, span) in cell_lines.spans.iter().enumerate() {
                                let cell_y = y + line_idx as u16;
                                if cell_y >= inner_area.y + inner_area.height {
                                    break;
                                }
-                                let spans: Vec<Span> = line.spans.iter().map(|s| {
+                                let styled_span = Span::styled(span.content.clone(), row_style);
-                                    Span::styled(s.content.clone(), row_style)
+                                f.render_widget(
-                                }).collect();
+                                    Paragraph::new(Line::from(styled_span)),
-                                f.render_widget(Paragraph::new(Line::from(spans)), Rect {
+                                    Rect {
                                        x: col_x,
                                        y: cell_y,
                                        width: col_width,
                                        height: 1,
-                                });
+                                    },
                                );
                            }
                        }
@@ -850,7 +911,8 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
                        }
                    }
                } else {
-                    let col_widths: Vec<Constraint> = cols.iter()
+                    let col_widths: Vec<Constraint> = cols
                        .iter()
                        .enumerate()
                        .map(|(i, _)| {
                            let w = col_max_widths[i] as u16;
@@ -1008,6 +1070,55 @@ fn ask_model(question: &str, schema: &str, model: &str, prompt_file: &str) -> Re
    Ok(ensure_sql(&sql))
 }
 fn ask_model_with_selection(
    question: &str,
    _full_schema: &str,
    model: &str,
    prompt_file: &str,
    use_selection: bool,
    embeddings_file: &str,
    schema_json: &str,
    similarity_threshold: f32,
 ) -> Result<String> {
    let prompt_template = fs::read_to_string(prompt_file)
        .with_context(|| format!("Não foi possível ler o prompt: {}", prompt_file))?;
    let (schema_to_use, selected_tables) = if use_selection {
        match table_selector::select_tables_from_question(
            question,
            embeddings_file,
            similarity_threshold,
        ) {
            Ok(table_ids) => {
                eprintln!(
                    "=> Selecionadas {} tables relevantes: {:?}",
                    table_ids.len(),
                    table_ids
                );
                let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
                let filtered_schema = schema_filter.filter_tables(&table_ids);
                (filtered_schema, Some(table_ids))
            }
            Err(e) => {
                eprintln!(
                    "=> Aviso: falha na seleção de tables ({}), usando schema completo",
                    e
                );
                let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
                (schema_filter.full_schema_text(), None)
            }
        }
    } else {
        let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
        (schema_filter.full_schema_text(), None)
    };
    let generator = sql_generator::create_sql_generator()?;
    let sql = generator.generate(question, &schema_to_use, &prompt_template)?;
    Ok(ensure_sql(&sql))
 }
 fn ask_gemini(question: &str, system_prompt: &str, model: &str) -> Result<String> {
    let key = env::var("GEMINI_API_KEY").context("GEMINI_API_KEY não definida")?;
    let url = format!(
@@ -1309,6 +1420,12 @@ VARIÁVEIS DE AMBIENTE
  OPENROUTER_API_KEY   necessária para modelos OpenRouter
  GEMINI_MODEL         modelo padrão (sobrescrito por --model)
  SCHEMA_FILE          DDL do schema  [context/schema_compact_inline.txt]
  SCHEMA_JSON          full schema JSON  [context/basedosdados-schema.json]
  EMBEDDINGS_FILE      table embeddings  [context/table_embeddings.json]
  TOP_K_TABLES         número de tables a selecionar  [5]
  SQL_GENERATOR        sql generator: sqlcoder|gemini|openrouter  [gemini]
  OLLAMA_MODEL         modelo ollama  [sqlcoder]
  OLLAMA_HOST          host ollama  [http://localhost:11434]
  PROMPT_FILE          prompt do sistema  [ask/system_prompt.md]
  DB_FILE              banco DuckDB  [basedosdados.duckdb]
 "#
@@ -1321,7 +1438,18 @@ VARIÁVEIS DE AMBIENTE
    });
    let schema_file =
        env::var("SCHEMA_FILE").unwrap_or_else(|_| "context/schema_compact_inline.txt".into());
-    let db_file = env::var("DB_FILE").unwrap_or_else(|_| "basedosdados.duckdb".into());
+    let schema_json =
        env::var("SCHEMA_JSON").unwrap_or_else(|_| "context/basedosdados-schema.json".into());
    let embeddings_file =
        env::var("EMBEDDINGS_FILE").unwrap_or_else(|_| "context/table_embeddings.json".into());
    let similarity_threshold = env::var("SIMILARITY_THRESHOLD")
        .ok()
        .and_then(|v| v.parse().ok())
        .unwrap_or(0.35);
    let use_table_selection = env::var("USE_TABLE_SELECTION")
        .map(|v| v != "false" && v != "0")
        .unwrap_or(true);
    let db_file = env::var("DB_FILE").unwrap_or_else(|_| "data/basedosdados.duckdb".into());
    let prompt_file = env::var("PROMPT_FILE").unwrap_or_else(|_| "ask/system_prompt.md".into());
    let schema = fs::read_to_string(&schema_file)
        .with_context(|| format!("Não foi possível ler o schema: {}", schema_file))?;
@@ -1333,6 +1461,10 @@ VARIÁVEIS DE AMBIENTE
            schema,
            db_file,
            prompt_file,
            use_table_selection,
            embeddings_file,
            schema_json,
            similarity_threshold,
        });
    }
@@ -1341,7 +1473,16 @@ VARIÁVEIS DE AMBIENTE
    eprintln!("\nModel:    {}\nPergunta: {}\n", model, question);
    let t0 = Instant::now();
-    let sql = ask_model(&question, &schema, &model, &prompt_file)?;
+    let sql = ask_model_with_selection(
        &question,
        &schema,
        &model,
        &prompt_file,
        use_table_selection,
        &embeddings_file,
        &schema_json,
        similarity_threshold,
    )?;
    eprintln!("=> SQL gerado em {}", fmt_duration(t0.elapsed()));
    print_sql_box(&sql);
--- a/ask/src/schema_filter.rs
+++ b/ask/src/schema_filter.rs
@@ -0,0 +1,135 @@
 use serde::{Deserialize, Serialize};
 use std::collections::HashSet;
 use std::fs;
 use std::path::Path;
 #[derive(Debug, Clone, Deserialize, Serialize)]
 pub struct Column {
    pub name: String,
    #[serde(rename = "type")]
    pub col_type: String,
    pub description: Option<String>,
 }
 pub type TableColumns = Vec<Column>;
 #[derive(Debug, Clone, Deserialize)]
 pub struct FullSchema {
    #[serde(flatten)]
    pub datasets:
        std::collections::HashMap<String, std::collections::HashMap<String, TableColumns>>,
 }
 pub struct SchemaFilter {
    schema: FullSchema,
 }
 impl SchemaFilter {
    pub fn new<P: AsRef<Path>>(schema_path: P) -> anyhow::Result<Self> {
        let content = fs::read_to_string(schema_path)?;
        let schema: FullSchema = serde_json::from_str(&content)?;
        Ok(Self { schema })
    }
    pub fn filter_tables(&self, table_ids: &[String]) -> String {
        let selected: HashSet<String> = table_ids.iter().cloned().collect();
        let mut lines = Vec::new();
        lines.push("# Base dos Dados — Filtered Schema".to_string());
        lines.push(
            "# Legend: V=VARCHAR I=INT D=DOUBLE Dt=DATE B=BOOLEAN Dec=DECIMAL Ts=TIMESTAMP Ti=TIME"
                .to_string(),
        );
        lines.push("# Format: dataset.table: col:TYPE description".to_string());
        lines.push(String::new());
        for (dataset, tables) in &self.schema.datasets {
            for (table, columns) in tables {
                let full_id = format!("{}.{}", dataset, table);
                if selected.contains(&full_id) {
                    let col_str = columns
                        .iter()
                        .map(|c| {
                            let desc = c.description.as_deref().unwrap_or("");
                            if desc.is_empty() {
                                format!("{}:{}", c.name, type_abbrev(&c.col_type))
                            } else {
                                format!("{}:{} {}", c.name, type_abbrev(&c.col_type), desc)
                            }
                        })
                        .collect::<Vec<_>>()
                        .join(" ");
                    lines.push(format!("{}: {}", full_id, col_str));
                }
            }
        }
        lines.join("\n")
    }
    pub fn full_schema_text(&self) -> String {
        let mut lines = Vec::new();
        lines.push("# Base dos Dados — Full Schema".to_string());
        lines.push(
            "# Legend: V=VARCHAR I=INT D=DOUBLE Dt=DATE B=BOOLEAN Dec=DECIMAL Ts=TIMESTAMP Ti=TIME"
                .to_string(),
        );
        lines.push("# Format: dataset.table: col:TYPE description".to_string());
        lines.push(String::new());
        for (dataset, tables) in &self.schema.datasets {
            for (table, columns) in tables {
                let full_id = format!("{}.{}", dataset, table);
                let col_str = columns
                    .iter()
                    .map(|c| {
                        let desc = c.description.as_deref().unwrap_or("");
                        if desc.is_empty() {
                            format!("{}:{}", c.name, type_abbrev(&c.col_type))
                        } else {
                            format!("{}:{} {}", c.name, type_abbrev(&c.col_type), desc)
                        }
                    })
                    .collect::<Vec<_>>()
                    .join(" ");
                lines.push(format!("{}: {}", full_id, col_str));
            }
        }
        lines.join("\n")
    }
    pub fn dataset_count(&self) -> usize {
        self.schema.datasets.len()
    }
    pub fn table_count(&self) -> usize {
        self.schema.datasets.values().map(|t| t.len()).sum()
    }
 }
 fn type_abbrev(full_type: &str) -> String {
    let upper = full_type.to_uppercase();
    if upper.contains("VARCHAR") || upper.contains("STRING") {
        "V".to_string()
    } else if upper.contains("INT") {
        "I".to_string()
    } else if upper.contains("DOUBLE") || upper.contains("FLOAT") {
        "D".to_string()
    } else if upper.contains("DATE") && !upper.contains("TIMESTAMP") {
        "Dt".to_string()
    } else if upper.contains("TIMESTAMP") {
        "Ts".to_string()
    } else if upper.contains("TIME") {
        "Ti".to_string()
    } else if upper.contains("BOOLEAN") {
        "B".to_string()
    } else if upper.contains("DECIMAL") {
        "Dec".to_string()
    } else {
        full_type.to_string()
    }
 }
--- a/ask/src/sql_generator.rs
+++ b/ask/src/sql_generator.rs
@@ -0,0 +1,207 @@
 use anyhow::{Context, Result};
 use serde_json::Value;
 use std::env;
 pub trait SqlGenerator: Send + Sync {
    fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String>;
 }
 pub fn create_sql_generator() -> Result<Box<dyn SqlGenerator>> {
    let generator_type = env::var("SQL_GENERATOR").unwrap_or_else(|_| "gemini".to_string());
    match generator_type.as_str() {
        "sqlcoder" => Ok(Box::new(SqlCoderGenerator::new()?)),
        "openrouter" => Ok(Box::new(OpenRouterGenerator::new()?)),
        "gemini" => Ok(Box::new(GeminiGenerator::new()?)),
        _ => anyhow::bail!(
            "Unknown SQL_GENERATOR: {}. Use: sqlcoder, gemini, or openrouter",
            generator_type
        ),
    }
 }
 pub struct GeminiGenerator {
    model: String,
    api_key: String,
 }
 impl GeminiGenerator {
    pub fn new() -> Result<Self> {
        let model = env::var("GEMINI_MODEL").unwrap_or_else(|_| "gemini-flash-latest".to_string());
        let api_key = env::var("GEMINI_API_KEY").context("GEMINI_API_KEY not defined")?;
        Ok(Self { model, api_key })
    }
 }
 impl SqlGenerator for GeminiGenerator {
    fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
        let url = format!(
            "https://generativelanguage.googleapis.com/v1beta/models/{}:generateContent",
            self.model
        );
        let system_prompt = format!("{}\n\nSchema DDL:\n\n{}", prompt_template.trim(), schema);
        let payload = serde_json::json!({
            "system_instruction": { "parts": [{ "text": system_prompt }] },
            "contents": [{ "parts": [{ "text": question }] }]
        });
        let client = reqwest::blocking::Client::builder()
            .timeout(std::time::Duration::from_secs(300))
            .build()?;
        let resp = client
            .post(&url)
            .header("Content-Type", "application/json")
            .header("X-goog-api-key", &self.api_key)
            .json(&payload)
            .send()
            .context("Gemini HTTP request failed")?;
        let status = resp.status();
        let body: Value = resp.json().context("Failed to parse Gemini response")?;
        if !status.is_success() {
            anyhow::bail!("Gemini API error {}: {}", status, body);
        }
        let text = body["candidates"][0]["content"]["parts"][0]["text"]
            .as_str()
            .context("Unexpected Gemini response format")?
            .trim()
            .to_string();
        Ok(strip_fences(&text))
    }
 }
 pub struct OpenRouterGenerator {
    model: String,
    api_key: String,
 }
 impl OpenRouterGenerator {
    pub fn new() -> Result<Self> {
        let model =
            env::var("OPENROUTER_MODEL").unwrap_or_else(|_| "openai/gpt-4o-mini".to_string());
        let api_key = env::var("OPENROUTER_API_KEY").context("OPENROUTER_API_KEY not defined")?;
        Ok(Self { model, api_key })
    }
 }
 impl SqlGenerator for OpenRouterGenerator {
    fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
        let url = "https://openrouter.ai/api/v1/chat/completions";
        let system_prompt = format!("{}\n\nSchema DDL:\n\n{}", prompt_template.trim(), schema);
        let payload = serde_json::json!({
            "model": self.model,
            "messages": [
                { "role": "system", "content": system_prompt },
                { "role": "user", "content": question }
            ]
        });
        let client = reqwest::blocking::Client::builder()
            .timeout(std::time::Duration::from_secs(300))
            .build()?;
        let resp = client
            .post(url)
            .header("Content-Type", "application/json")
            .header("Authorization", format!("Bearer {}", self.api_key))
            .header("HTTP-Referer", "https://basedosdados.org")
            .header("X-Title", "Base dos Dados Ask")
            .json(&payload)
            .send()
            .context("OpenRouter HTTP request failed")?;
        let status = resp.status();
        let body: Value = resp.json().context("Failed to parse OpenRouter response")?;
        if !status.is_success() {
            anyhow::bail!("OpenRouter API error {}: {}", status, body);
        }
        let text = body["choices"][0]["message"]["content"]
            .as_str()
            .context("Unexpected OpenRouter response format")?
            .trim()
            .to_string();
        Ok(strip_fences(&text))
    }
 }
 pub struct SqlCoderGenerator {
    model: String,
    host: String,
 }
 impl SqlCoderGenerator {
    pub fn new() -> Result<Self> {
        let model = env::var("OLLAMA_MODEL").unwrap_or_else(|_| "sqlcoder".to_string());
        let host = env::var("OLLAMA_HOST").unwrap_or_else(|_| "http://localhost:11434".to_string());
        Ok(Self { model, host })
    }
 }
 impl SqlGenerator for SqlCoderGenerator {
    fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
        let url = format!("{}/api/generate", self.host);
        let full_prompt = format!(
            "{}\n\nSchema DDL:\n\n{}\n\nQuestion: {}\n\nSQL:",
            prompt_template.trim(),
            schema,
            question
        );
        let payload = serde_json::json!({
            "model": self.model,
            "prompt": full_prompt,
            "stream": false
        });
        let client = reqwest::blocking::Client::builder()
            .timeout(std::time::Duration::from_secs(300))
            .build()?;
        let resp = client
            .post(&url)
            .header("Content-Type", "application/json")
            .json(&payload)
            .send()
            .context("Ollama HTTP request failed")?;
        let status = resp.status();
        let body: Value = resp.json().context("Failed to parse Ollama response")?;
        if !status.is_success() {
            anyhow::bail!("Ollama API error {}: {}", status, body);
        }
        let text = body["response"]
            .as_str()
            .context("Unexpected Ollama response format")?
            .trim()
            .to_string();
        Ok(strip_fences(&text))
    }
 }
 fn strip_fences(text: &str) -> String {
    let text = text.trim();
    if text.starts_with("```sql") {
        let end = text.find("```").unwrap_or(text.len());
        text[5..end].trim().to_string()
    } else if text.starts_with("```") {
        let end = text[3..].find("```").map(|i| i + 3).unwrap_or(text.len());
        text[3..end].trim().to_string()
    } else {
        text.to_string()
    }
 }
--- a/ask/src/table_selector.rs
+++ b/ask/src/table_selector.rs
@@ -0,0 +1,146 @@
 use serde::{Deserialize, Serialize};
 use std::fs;
 use std::path::Path;
 const DEFAULT_SIMILARITY_THRESHOLD: f32 = 0.35;
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct TableEmbedding {
    pub id: String,
    pub text: String,
    pub embedding: Vec<f32>,
 }
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct EmbeddingsIndex {
    pub tables: Vec<TableEmbedding>,
    pub model: String,
 }
 pub struct TableSelector {
    tables: Vec<TableEmbedding>,
    threshold: f32,
 }
 impl TableSelector {
    pub fn new<P: AsRef<Path>>(embeddings_path: P, threshold: f32) -> anyhow::Result<Self> {
        let content = fs::read_to_string(embeddings_path)?;
        let index: EmbeddingsIndex = serde_json::from_str(&content)?;
        Ok(Self {
            tables: index.tables,
            threshold,
        })
    }
    pub fn select_tables(
        &self,
        question: &str,
        model: &dyn QuestionEmbedder,
    ) -> anyhow::Result<Vec<String>> {
        let question_embedding = model.embed(question)?;
        let mut similarities: Vec<(usize, f32)> = self
            .tables
            .iter()
            .enumerate()
            .map(|(i, table)| {
                let sim = cosine_similarity(&question_embedding, &table.embedding);
                (i, sim)
            })
            .collect();
        similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
        let selected: Vec<String> = similarities
            .into_iter()
            .filter(|(_, sim)| *sim >= self.threshold)
            .map(|(i, sim)| {
                eprintln!("  {} (similarity: {:.3})", self.tables[i].id, sim);
                self.tables[i].id.clone()
            })
            .collect();
        Ok(selected)
    }
    pub fn get_table_texts(&self, table_ids: &[String]) -> Vec<String> {
        table_ids
            .iter()
            .filter_map(|id| self.tables.iter().find(|t| &t.id == id))
            .map(|t| t.text.clone())
            .collect()
    }
    pub fn table_count(&self) -> usize {
        self.tables.len()
    }
 }
 pub trait QuestionEmbedder: Send + Sync {
    fn embed(&self, text: &str) -> anyhow::Result<Vec<f32>>;
 }
 pub struct LocalEmbedder {
    model_path: String,
 }
 impl LocalEmbedder {
    pub fn new(model_path: String) -> Self {
        Self { model_path }
    }
 }
 impl QuestionEmbedder for LocalEmbedder {
    fn embed(&self, text: &str) -> anyhow::Result<Vec<f32>> {
        use std::process::Command;
        let output = Command::new("python3")
            .args([
                "-c",
                &format!(
                    r#"
 import json
 from sentence_transformers import SentenceTransformer
 model = SentenceTransformer('{}')
 emb = model.encode('{}', convert_to_numpy=True)
 print(json.dumps([float(x) for x in emb]))
 "#,
                    self.model_path,
                    text.replace("'", "\\'")
                ),
            ])
            .output()?;
        if !output.status.success() {
            let err = String::from_utf8_lossy(&output.stderr);
            anyhow::bail!("Embedding generation failed: {}", err);
        }
        let output_str = String::from_utf8_lossy(&output.stdout);
        let floats: Vec<f32> = serde_json::from_str(&output_str)?;
        Ok(floats)
    }
 }
 fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
    let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
    let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
    let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
    if norm_a == 0.0 || norm_b == 0.0 {
        0.0
    } else {
        dot_product / (norm_a * norm_b)
    }
 }
 pub fn select_tables_from_question(
    question: &str,
    embeddings_path: &str,
    threshold: f32,
 ) -> anyhow::Result<Vec<String>> {
    let selector = TableSelector::new(embeddings_path, threshold)?;
    let embedder = LocalEmbedder::new("all-MiniLM-L6-v2".to_string());
    selector.select_tables(question, &embedder)
 }
--- a/ask/system_prompt.md
+++ b/ask/system_prompt.md
@@ -147,3 +147,68 @@ LIMIT 30
    if the question requires tables not in the provided DDL, OR
      If you cant generate a valid SQL, 
        answer as a JSON {error: "#{reason}"}
 ## Common SQL Pitfalls & Debugging Strategy
 ### 1. Column Propagation in CTEs (Most Common Error!)
 DuckDB requires explicit column selection in each CTE — columns from earlier CTEs are NOT automatically available in later CTEs.
 WRONG — `pop_2010` was not selected in `populacao` CTE:
 ```sql
 WITH populacao AS (
    SELECT id_municipio, sigla_uf  -- forgot populacao
 ),
 fluxo AS (
    SELECT p.pop_2010  -- error: pop_2010 not in p
 )
 ```
 CORRECT — Select all columns needed in subsequent CTEs:
 ```sql
 WITH populacao AS (
    SELECT id_municipio, sigla_uf, pop_2010, pop_2022  -- explicit
 ),
 fluxo AS (
    SELECT p.pop_2010  -- works
 )
 ```
 ### 2. ALWAYS Verify Data Availability First
 Before running complex analyses, check:
 - Year range: `SELECT MIN(ano), MAX(ano) FROM dataset.table`
 - Record count: `SELECT COUNT(*) FROM dataset.table`
 - ID format compatibility between tables before JOIN
 ### 3. Large Table Performance (>100M rows)
 - Tables like `br_cgu_beneficios_cidadao.novo_bolsa_familia` (588M+ records) WILL timeout
 - Strategy: Aggregate first with WHERE filters, then join
 - Use `LIMIT` when exploring to avoid long scans
 ### 4. Lock Conflicts
 Multiple concurrent DuckDB queries on the same `.duckdb` file cause lock errors.
 - Wait between queries or use read-only mode
 ### 5. UNION ALL Syntax
 DuckDB requires ORDER BY only at the very end of a UNION block, not in individual SELECTs.
 WRONG:
 ```sql
 SELECT ... LIMIT 5
 ORDER BY x
 UNION ALL
 SELECT ... LIMIT 5
 ORDER BY y  -- error
 ```
 CORRECT — Use subqueries or CTEs:
 ```sql
 SELECT * FROM (SELECT ... ORDER BY x LIMIT 5) a
 UNION ALL
 SELECT * FROM (SELECT ... ORDER BY y LIMIT 5) b
 ```
 ### 6. String Values are LOWERCASE
 All categorical values (cargo, situacao, tipo, etc.) are stored in lowercase.
 Always use: `WHERE cargo = 'deputado federal'` not `'DEPUTADO FEDERAL'`
--- a/basedosdados.duckdb
+++ b/basedosdados.duckdb
--- a/context/basedosdados-schema.json
+++ b/context/basedosdados-schema.json
--- a/context/table_embeddings.json
+++ b/context/table_embeddings.json
--- a/data/basedosdados.duckdb
+++ b/data/basedosdados.duckdb
--- a/docs/dataset_embeds.md
+++ b/docs/dataset_embeds.md
@@ -0,0 +1,59 @@
 ## Goal
 Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.
 ## Instructions
 - Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
 - Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
 - Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
 - Include column descriptions from basedosdados-schema.json in table embeddings
 - Generate word clouds from schema attributes and dataset names for docs
 ## Discoveries
 - **Schema format**: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
 - **Embeddings work**: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
 - **Threshold tuning**: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
 - **sqlcoder issues**: Returns JSON instead of SQL when using `format: "json"` - removing it helps but still generates imperfect SQL
 - **Retry mechanism**: Already built into main.rs - helps fix SQL errors automatically
 - **Top donation query works**: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)
 ## Accomplished
 1. ✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json
 2. ✅ Created table_embeddings.json (~2MB, 765 tables)
 3. ✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
 4. ✅ Created schema_filter.rs - extracts filtered schema from full JSON
 5. ✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
 6. ✅ Modified main.rs - integrated table selection + configurable SQL generator
 7. ✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes)
 8. ✅ Updated README.md with new architecture and env vars
 9. ✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/
 ## Relevant files / directories
 ### Created/Modified
 - `embed_tables.py` - Python script to generate table embeddings
 - `context/table_embeddings.json` - Pre-computed embeddings (765 tables)
 - `ask/src/table_selector.rs` - Table selection via embeddings
 - `ask/src/schema_filter.rs` - Schema filtering module
 - `ask/src/sql_generator.rs` - SQL generator trait + implementations
 - `ask/src/main.rs` - Integrated all components
 - `ask/Cargo.toml` - Added serde dependency
 - `README.md` - Updated with new architecture
 - `docs/wordcloud_attributes.png` - Word cloud from column names/descriptions
 - `docs/wordcloud_datasets.png` - Word cloud from dataset names
 ### Configuration (env vars)
 - `SQL_GENERATOR` - sqlcoder|gemini|openrouter
 - `SIMILARITY_THRESHOLD` - 0.35 default
 - `OLLAMA_MODEL` - sqlcoder:7b-q4_K_M
 - `EMBEDDINGS_FILE`, `SCHEMA_JSON`
 ## Next Steps
 - Increase similarity threshold (try 0.45) to reduce table count
 - Improve sqlcoder prompt for better SQL generation
 - Add fallback to increase threshold if too many tables selected
 - Consider keyword matching as backup if embeddings fail
--- a/docs/file_tree.md
+++ b/docs/file_tree.md
--- a/docs/patterns-audit.md
+++ b/docs/patterns-audit.md
@@ -0,0 +1,299 @@
 # Pattern Audit — Robustness & False Positive Analysis
 Deep audit of all 8 risk patterns. For each pattern: legal basis, threshold rationale, known false positive scenarios, data quality notes, and differences between the per-CNPJ (interactive) and batch (scan-all) implementations.
 ---
 ## US1 — Split Contracts Below Threshold (`split_contracts_below_threshold`)
 ### Legal basis
 **Fracionamento de licitação** is prohibited by:
 - Lei 8.666/1993, art. 23, §5º: "É vedada a utilização da modalidade 'convite' ou 'tomada de preços' [...] para parcelas de uma mesma obra ou serviço."
 - Lei 14.133/2021, art. 145: directly prohibits splitting to evade the mandatory bidding requirement.
 ### Threshold: year-dependent
 | Period | Threshold | Legal basis |
 |---|---|---|
 | ≤ 2023 | R$ 17.600 | Decreto 9.412/2018 / Lei 8.666/93 art. 23, I, "a" |
 | 2024+ | R$ 57.912 | Decreto 11.871/2024 / Lei 14.133/2021 art. 75, I |
 For 2023 data many contracts still ran under Lei 8.666/93 (both laws co-existed). From 2024 the threshold is R$57.912. Using a static R$17.600 for 2024+ data would miss the main fraud window (R$17k–R$57k per contract). **Fixed (iteration 7):** all three implementations compute the threshold from the query year.
 ### False positive scenarios
 1. **Legitimate multi-item purchasing**: A supplier providing diverse small items (office supplies, food for canteen) legitimately generates many small contracts below threshold from the same agency. The `combined_value > threshold` guard reduces but doesn't eliminate this.
 2. **Recurring service contracts**: Monthly service fees (e.g., R$1.500/month cleaning) generate 12 contracts/year — correctly NOT flagged (combined = R$18.000 > threshold, count ≥ 3 in first 3 months).
 3. **Different sub-units**: The grouping uses `id_orgao_superior` (ministry level). A ministry with many sub-units contracting independently may not be splitting; they may have independent needs.
 ### Improvements applied
 - None structural. Filter `valor_inicial_compra > 0` prevents division issues.
 ### Known data quality issues
 - `data_assinatura_contrato` can be NULL for some older contracts. **`FORMAT_DATE` on NULL returns NULL — it does NOT exclude those rows.** Without a guard, all NULL-dated contracts from the same agency would be grouped together under a single `NULL` month bucket, potentially producing a false flag if ≥3 of them are below threshold with combined value > threshold. Fixed (iteration 5): all three implementations now include `AND data_assinatura_contrato IS NOT NULL` in the WHERE clause.
 - `valor_inicial_compra` vs `valor_final_compra`: we use `valor_inicial_compra` intentionally since splitting is defined by the contract as signed, not final.
 ### Improvements applied (iteration 5)
 - Added `AND data_assinatura_contrato IS NOT NULL` to WHERE clause in all three implementations to prevent NULL-date contracts from being grouped into a spurious `mes = NULL` bucket.
 ### Per-CNPJ vs batch consistency
 ✅ Fixed (iteration 8): `scan-all.ts` now includes `id_orgao_superior` in both SELECT and GROUP BY, matching `index.ts` and `scan-suspicious.ts`. Prevents theoretical merging of two distinct ministries sharing the same name.
 ---
 ## US2 — Contract Concentration (`contract_concentration`)
 ### Legal basis
 No specific legal prohibition, but **TCU** and **CGU** audit methodology treat >40% share of a single agency's budget as a prima facie risk indicator requiring justification.
 - Reference: CGU "Manual de Orientações para Análise de Risco em Compras Públicas" (2022), section 4.2.
 ### Thresholds
 - **40% share**: empirical; above this, competition is functionally absent for that agency.
 - **R$ 50.000 minimum agency total**: excludes micro-units (small local offices) where one purchase naturally dominates.
 - **R$ 10.000 minimum supplier spend** (new, iteration 2): excludes trivial cases like a company with R$21k of a R$50k agency = 42% but both numbers are small.
 ### False positive scenarios
 1. **Specialized niches**: A sole provider of a specialized service (e.g., judicial translation, specific medical device) may legitimately dominate one agency's procurement. No CNAE-based filter exists.
 2. **Monopolistic markets**: Some goods/services have few suppliers by nature (utilities, telecommunications infrastructure).
 3. **Framework agreements**: A single framework contract can make one supplier appear to dominate even if bidding was competitive at framework establishment.
 ### Improvements applied
 - Added `CONCENTRATION_MIN_SUPPLIER_SPEND = 10_000` to batch query and `scan-suspicious.ts` (iteration 2).
 - Added `CONCENTRATION_MIN_SUPPLIER_SPEND` filter to `index.ts` `patternConcentration` HAVING clause (iteration 4 — was present in batch/scan-suspicious but missing from web UI).
 ### Per-CNPJ vs batch consistency
 ✅ Fixed (iteration 4): `index.ts` HAVING clause now includes `supplier_spend >= CONCENTRATION_MIN_SUPPLIER_SPEND`.
 ✅ Fixed (iteration 9): `scan-all.ts` and `scan-suspicious.ts` now group by `(id_orgao_superior, nome_orgao_superior)` in both the spend and ministry_total CTEs, joining on the composite key. All three implementations are consistent.
 ---
 ## US3 — Inexigibility Recurrence (`inexigibility_recurrence`)
 ### Legal basis
 **Inexigibilidade de licitação** (Lei 14.133/2021 art. 74; Lei 8.666/93 art. 25) is legal when competition is technically impossible (e.g., exclusive supplier, artistic performances). Abuse occurs when agencies use inexigibilidade repeatedly for the same supplier to avoid competitive bidding.
 - Reference: **TCU Acórdão 1.793/2011**: defines recurrent inexigibilidade as a risk indicator requiring documentation of technical exclusivity per contract.
 ### Threshold: 3 contracts per managing unit
 - Below 3: could be two legitimate sole-source needs in the same year.
 - At 3+: pattern suggests systematic routing of contracts to avoid bidding.
 ### False positive scenarios
 1. **Legitimate exclusive suppliers**: Publishers (publishing rights), performing arts venues, specialized IT vendors with proprietary systems legitimately receive many inexigibilidade contracts.
 2. **Long-term technical partnerships**: An agency may have a multi-year framework with an exclusive technical partner, generating many inexigibilidade contracts each year.
 3. **Artistic/cultural organizations**: Museums, theaters, and orchestras commonly contract artists via inexigibilidade.
 ### Improvements applied (iteration 2)
 - **Batch + scan-suspicious**: Now groups by `id_unidade_gestora` (ID) + `nome_unidade_gestora` (name). Previously grouped by name only, risking merger of distinct units sharing a common name.
 - **Batch + scan-suspicious**: Added `valor_inicial_compra >= R$ 1.000` filter. Micro-value contracts (< R$1k) rarely represent real abuse.
 ### Improvements applied (iteration 4)
 - **`index.ts`**: Added `AND valor_inicial_compra >= @min_value` to WHERE clause of `patternInexigibility`. The web UI was missing this filter, causing micro-value contracts to inflate the count and trigger false flags.
 ### Per-CNPJ vs batch consistency
 ✅ Fixed (iteration 4): all three implementations now filter `valor_inicial_compra >= R$ 1.000` and group by `id_unidade_gestora`.
 ---
 ## US4 — Single Bidder (`single_bidder`)
 ### Legal basis
 Not inherently illegal, but flagged by:
 - **Open Contracting Partnership "73 Red Flags" (2024)**, Flag #1: "Only one bid received."
 - CGU "Programa de Fiscalização em Entes Federativos" 2023: single-bidder rate >30% is a tier-1 risk indicator.
 ### Threshold: 2 occurrences
 - Intentionally low. Even one solo-bid win warrants investigation context. Two is the minimum pattern.
 ### False positive scenarios
 1. **Specialized markets**: Satellite communications, nuclear materials, specialized medical devices — few vendors exist globally.
 2. **Geographic isolation**: Remote municipalities with limited local suppliers naturally attract few bidders even for standard goods.
 3. **Poorly timed notices**: Short bid windows or holiday periods reduce participation regardless of market structure.
 ### SQL robustness notes
 - Per-CNPJ: uses `STARTS_WITH(REGEXP_REPLACE(...), @cnpj)` — this matches any CNPJ where the base 8 digits match, including subsidiaries/branches. This is intentional: a corporate group that operates through multiple CNPJs should still surface.
 - Batch: uses `MAX(IF(vencedor AND LENGTH(...) = 14, SUBSTR(...), NULL))` to extract the winner's CNPJ from the `auction_stats` CTE. The `LENGTH = 14` guard in the `IF` condition ensures CPF winners don't produce invalid 8-digit keys. If two CNPJ rows have `vencedor=true` for the same auction (data quality issue), `MAX` picks lexicographically last — acceptable for batch purposes.
 ### Per-CNPJ vs batch consistency
 ✅ Fixed (iteration 8): **batch now counts ALL participants** (CPF + CNPJ) for `total_bidders`, matching per-CNPJ behavior. Previously, `LENGTH = 14` excluded CPF individuals from the count, causing the batch to over-flag auctions where a CPF participant was present. The `LENGTH = 14` guard is now applied only inside the `winner_cnpj` extraction `IF()` condition — not to the overall participant count.
 ---
 ## US5 — Always Winner (`always_winner`)
 ### Legal basis
 Not illegal per se, but high win rates in competitive auctions indicate possible:
 - Bid rigging (Lei 12.529/2011 art. 36, IV)
 - Tailored specifications (Lei 14.133/2021 art. 9, I)
 - Reference: **OCDE "Guidelines for Fighting Bid Rigging in Public Procurement" (2021)**
 ### Thresholds
 - **≥80% win rate** (per-CNPJ, fixed) — raised from 60% to reduce false positives. Batch uses dynamic Q3 (empirically ≈100% in this dataset).
 - **≥10 competitive participations** — minimum sample for statistical significance. Aligns batch and per-CNPJ.
 - **Competitive auctions only (≥2 bidders)** — critical to avoid overlap with US4.
 ### Critical fix applied (iteration 2)
 **The per-CNPJ version was NOT filtering for competitive auctions before this iteration.** A company that always won because it was always the only bidder would be flagged by both US4 (single_bidder) AND US5 (always_winner) — misleading double-counting. Fixed by adding a `competitive_auctions` CTE that filters `COUNT(1) >= 2`.
 ### Win rate distribution note
 The `licitacao_participante` dataset is **strongly bimodal**: approximately 33% of companies with ≥10 competitive participations have a perfect 100% win rate. The distribution does not follow a normal or uniform pattern. Q3 ≈ 1.0 regardless of the minimum sample cutoff (tested at 5, 10, 20). The dynamic Q3 threshold therefore flags only **perfect-win companies** — intentionally strict. This is documented in the spec.
 ### Per-CNPJ vs batch consistency
 ✅ Fixed (iteration 2): both now filter for competitive auctions. Batch uses dynamic Q3; per-CNPJ uses fixed 0.80 threshold. The fixed threshold produces a slightly broader result set on the interactive page, which is acceptable — the batch feed should be conservative; per-CNPJ investigation mode can be more sensitive.
 ---
 ## US6 — Amendment Inflation (`amendment_inflation`)
 ### Legal basis
 **Lei 14.133/2021 art. 125 §1º**: amendments may not increase the contract value by more than 25% of the original (for goods/services) or 50% (for construction). Inflation ≥ 1.25× means the contract **reached or exceeded its legal ceiling**.
 ### Threshold: 1.25× (25% above original)
 - Exactly the legal maximum. Contracts at 1.25× are at the legal limit; contracts above are potentially illegal unless specific circumstances apply (art. 125 §2º exceptions).
 ### False positive scenarios
 1. **Lawful exceptional amendments**: Art. 125 §2º allows exceeding 25% for "additional work indispensable to the object's completion" — requires specific administrative justification.
 2. **Construction contracts**: Legal ceiling is 50% (not 25%). Our threshold of 1.25× flags construction contracts that are within the legal limit.
 3. **Value adjustment clauses**: Contracts with inflation adjustment clauses (INPC/IPCA) can legitimately reach or exceed 1.25× over multi-year terms without any amendment.
 4. **Data entry errors**: Some `valor_final_compra` values are clearly data quality issues (e.g., 100× original).
 ### Improvements applied (iteration 3)
 - **Cap `inflation_ratio` at 10×** (`AMENDMENT_MAX_INFLATION_RATIO = 10.0`): ratios above this threshold are almost certainly data entry errors (e.g., `valor_final_compra` entered in a different unit) and would distort `total_excess` reporting. Applied to all three implementations via `AND ... <= @max_ratio` filter in SQL. Applied in `index.ts`, `scan-all.ts`, `scan-suspicious.ts`.
 ### Schema verification: construction vs goods/services threshold
 Lei 14.133/2021 art.125 §1º allows 50% amendments for engineering works vs 25% for goods/services.
 **Column verified (schema dump):** `contrato_compra` has `id_modalidade_licitacao` (code) and `modalidade_licitacao` (name). However, this column encodes **bidding modality** (Concorrência, Pregão Eletrônico, Tomada de Preços, etc.) — not contract category (obras vs bens/serviços). There is no `tipo_contrato` or `categoria` column in the accessible schema.
 ### Improvements applied (iteration 8): construction keyword detection
 All three implementations now apply `IF(REGEXP_CONTAINS(LOWER(IFNULL(objeto, '')), r'obra|constru|reform|engenhari|paviment|demoli'), 1.50, 1.25)` to select the applicable legal threshold per contract. This reduces false positives for legitimate construction/engineering amendments that fall between 1.25× and 1.50×.
 **Keywords and rationale:**
 | Keyword | Matches | Rationale |
 |---------|---------|-----------|
 | `obra` | obra, obras | General construction work |
 | `constru` | construção, construir | Building/construction |
 | `reform` | reforma, reformar, reformas | Renovation/remodeling |
 | `engenhari` | engenharia, engenheiro | Engineering services |
 | `paviment` | pavimentação, pavimento | Road/floor paving |
 | `demoli` | demolição, demolir | Demolition |
 **Known limitations:** The `objeto` field is free-text entered by procurement officers. Some construction contracts may use generic descriptions ("serviços de manutenção") and be missed by this detection — applying the 1.25× threshold is safe for those (conservative false positive vs missed construction exemption).
 ### Improvements applied (iteration 9): constructionCount field
 `AmendmentInflationFlag` now includes `constructionCount`: the number of flagged contracts that matched the construction keywords and were therefore evaluated at the 1.50× threshold. The UI card shows this count with a tooltip explaining the applicable legal ceiling. This helps analysts distinguish "inflated by >25% on goods (potentially illegal)" from "inflated by >50% on obras (definitely exceeds even the construction ceiling)."
 ### Per-CNPJ vs batch consistency
 ⚠️ Minor divergence (accepted): `index.ts` includes the aditivos CTE (`zeroAmendmentCount`) and `constructionCount` from `is_construction`. The batch scanners do NOT include these — `contrato_termo_aditivo` full scan is too expensive in batch, and `constructionCount` is per-row info not aggregable without the row-level data. Both fields are only available in the web UI's per-CNPJ output.
 ---
 ## US7 — Newborn Company (`newborn_company`)
 ### Legal basis
 No specific prohibition, but:
 - **Lei 14.133/2021 art. 68, I**: suppliers must demonstrate technical and economic qualification. Newly incorporated companies rarely can.
 - CGU "Guia Prático de Análise de Empresas de Fachada" (2021): age < 6 months at contract signing is a tier-1 indicator of possible shell company.
 ### Thresholds
 - **180 days** (6 months): practical minimum for legitimate operational readiness.
 - **R$ 50.000 minimum contract value**: excludes training contracts and small acquisitions where new companies are common and low-risk.
 ### False positive scenarios
 1. **Spinoffs and restructurings**: A newly incorporated CNPJ may be a restructured entity of an existing business with full operational capacity.
 2. **Holding company structures**: A holding created to receive a specific contract may have the technical capacity of its parent, not its founding date.
 3. **Startups in innovation programs**: Government startup accelerator programs (e.g., FAPESP TT, EMBRAPII) specifically contract very new companies.
 4. **`data_inicio_atividade` from establishments**: The founding date comes from `br_me_cnpj.estabelecimentos`, not `empresas`. Branches opened after the headquarter can make an established company appear "newborn" in a specific municipality.
 ### Data quality note
 `data_inicio_atividade` lives in `br_me_cnpj.estabelecimentos`, NOT `empresas`. The query uses `MIN(est.data_inicio_atividade)` across all establishments for the same `cnpj_basico` — this correctly picks the earliest known opening date, reducing the false positive of branches.
 ### Per-CNPJ vs batch consistency
 ✅ Equivalent. Both use `MIN(data_inicio_atividade)` across establishments with `ano=2023 AND mes=12`.
 ⚠️ **Known necessary full-table scan**: The `first_contract` CTE in `batchNewborn` (`scan-all.ts`) intentionally omits an `ano` filter on `contrato_compra`:
 ```sql
 FROM `basedosdados.br_cgu_licitacao_contrato.contrato_compra`
 WHERE LENGTH(REGEXP_REPLACE(cpf_cnpj_contratado, r'\D', '')) = 14
  AND valor_final_compra >= <MIN_VALUE>
 GROUP BY cnpj_basico
 ```
 This is a deliberate exception to the "zero full-table scans" rule from the spec. The pattern asks: *"did this company win its very first contract within 180 days of founding?"* Restricting to `ano = ANO` would miss the true first contract if it occurred in an earlier year — producing a false negative. The `founding` CTE correctly filters `e.ano = ANO AND est.ano = ANO AND est.mes = 12`. Only `first_contract` scans all years, but the `LENGTH = 14` CPF exclusion and `valor_final_compra >= R$ 50k` filter significantly reduce bytes scanned.
 ---
 ## US8 — Sudden Surge (`sudden_surge`)
 ### Legal basis
 Not illegal, but flagged by:
 - **UNODC "Guidebook on anti-corruption in public procurement" (2013)**: "Sudden large increase in a company's public contract revenue" is a tier-2 risk indicator.
 - TCU Acórdão 2.622/2015: large YoY procurement increases without prior procurement history warrant scrutiny.
 ### Thresholds
 - **5× YoY growth**: chosen to exclude normal business growth (2-3×) while flagging exponential jumps.
 - **R$ 1.000.000 minimum**: a 5× jump from R$200k to R$1M is meaningful; from R$10k to R$50k is noise.
 - **4-year lookback**: captures context before the surge.
 ### False positive scenarios
 1. **Post-restructuring recovery**: A company that was inactive for 2 years then resumed full operations would appear to surge.
 2. **New framework agreements**: Being added to a large framework agreement in year N can produce apparent surge with no underlying change in the company.
 3. **Government budget cycles**: Some sectors receive large multi-year contracts every 4 years (e.g., IT system replacements) creating apparent surges.
 ### SQL robustness note
 Both per-CNPJ and batch use `prev_v > 0` guard to exclude zero→nonzero transitions (handled by US7 newborn_company instead). The batch uses `LAG` window function; per-CNPJ iterates over the history array client-side.
 **Consecutive-year guard (iteration 6):** The spec says `value[year_N] / value[year_N-1]`. Without a guard, `LAG` compares any adjacent rows in sorted order — if a company had data in 2019 and 2023 (dormant 2020–2022), the comparison spans 4 years and produces a false surge. Fixed by:
 - `scan-all.ts`: added `LAG(ano)` alongside `LAG(v)` and `WHERE ano - prev_ano = 1`
 - `index.ts`, `scan-suspicious.ts`: added `curr.ano - prev.ano === 1` to the JS loop condition
 **false positive (false negative from audit):** The first false positive scenario (post-restructuring recovery) is now LESS likely to trigger since the consecutive-year guard would catch companies dormant for ≥1 year.
 The per-CNPJ implementation reports only the **first** qualifying surge year (breaks on first hit). If a company surged twice, only the earlier event is shown. This is conservative.
 ### Per-CNPJ vs batch consistency
 ✅ Equivalent. Batch uses SQL `LAG`; per-CNPJ uses JS loop. Both find the first qualifying year.
 ---
 ## Infrastructure Issue: Cache Miss vs Stored Null
 ### Bug 1: Cache Miss vs Stored Null (fixed iteration 6)
 `cache.ts` `getCache` was returning `null` for both cache misses (file not found) and legitimately stored null values (pattern found nothing). Patterns US4–US8 and the company lookup all use `null` as their "nothing found" sentinel and check `cached !== undefined` to skip re-querying. With the old `getCache` returning `null` on miss, `null !== undefined` evaluated to `true`, causing the BigQuery query to be skipped permanently — US4–US8 would never execute on a CNPJ not yet in cache.
 **Fix:** `getCache` now returns `undefined` on miss or expiry; returns `T` (including `null`) on a valid cache hit. The company-lookup caller that used `!== null` was updated to `!== undefined`.
 ### Bug 2: Falsy cache check for array-returning patterns (fixed iteration 7)
 US1, US2, US3, and `runPatterns()` in `index.ts` used `if (cached) return cached` to check for cache hits. An empty array `[]` is **falsy** in JavaScript — so a cached "no flags found" result (a real cache hit) was silently discarded, causing BigQuery to be re-queried on every subsequent call for clean CNPJs.
 Affected: `patternSplitContracts`, `patternConcentration`, `patternInexigibility`, `runPatterns`.
 **Fix:** changed all four to `if (cached !== undefined) return cached`. (US4–US8 already used this pattern since they cache `null` as "nothing found" — they were correct.)
 ---
 ## Cross-Pattern Issues
 ### Overlap between US4 and US5
 - **Before iteration 2**: US5 per-CNPJ would flag solo-bid winners as "always winner", creating confusing double flags.
 - **After iteration 2**: US5 filters to competitive auctions only. A pure solo-bid company gets US4 only; a company that wins competitive auctions at high rates gets US5 only; both behaviors together get both flags independently.
 ### Overlap between US7 and US8
 - A newborn company with a sudden surge would be flagged by both US7 (age at contract) and US8 (YoY growth). This is intentional and additive — both signals reinforce each other.
 ### CNPJ matching strategy
 All patterns use `cnpj_basico` (8-digit root) as the joining key. This means **all branches and subsidiaries** of a corporate group are attributed to the same `cnpj_basico`. This can create false positives for large corporations with many legitimate establishments (e.g., Correios, Petrobras) that naturally have contracts across many agencies.
 ---
 ## Summary Table
 | Pattern | FP Risk | Legal Basis | Fixes Applied |
 |---------|---------|------------|---------------|
 | US1 Split | Medium — multi-item purchasing | Decreto 9.412/2018 / Decreto 11.871/2024 | NULL date guard; year-dependent threshold (R$17.600 ≤2023, R$57.912 2024+); falsy cache check fixed; **batch GROUP BY now includes id_orgao_superior** |
 | US2 Concentration | Medium — specialized markets | CGU 2022 methodology | Added min supplier spend to all 3 implementations; **falsy cache check fixed**; **all 3 now GROUP BY (id+name) — no ministry-name collision** |
 | US3 Inexigibility | High — legitimate exclusive suppliers | TCU Acórdão 1.793/2011 | Fixed grouping by ID; added min value to all 3 implementations; **falsy cache check fixed** |
 | US4 Single Bidder | Medium — specialized/remote markets | OCP 2024 Flag #1 | **cache.ts bug fixed** (getCache null-vs-undefined); **batch now counts all participants (CPF+CNPJ)** — consistent with per-CNPJ |
 | US5 Always Winner | **Was HIGH** (no competitive filter) → Now Medium | OCDE 2021 | Fixed: competitive auctions only; raised thresholds; **cache.ts bug fixed** |
 | US6 Amendment | Medium — inflation clauses | Lei 14.133/2021 art.125 | Added 10× inflation cap; **cache.ts bug fixed**; **construction keyword detection: 1.50× threshold for obras/etc.**; **constructionCount in UI flag** |
 | US7 Newborn | High — spinoffs, restructurings | CGU 2021 guide | **cache.ts bug fixed** (was never querying BigQuery on cache miss) |
 | US8 Surge | Medium — framework agreements, budget cycles | UNODC 2013 | Added consecutive-year guard; **cache.ts bug fixed** |
--- a/docs/schemas.json
+++ b/docs/schemas.json
--- a/docs/wordcloud_attributes.png
+++ b/docs/wordcloud_attributes.png
--- a/docs/wordcloud_attributes.py
+++ b/docs/wordcloud_attributes.py
@@ -0,0 +1,45 @@
 #!/usr/bin/env python3
 import json
 import re
 from collections import Counter
 from wordcloud import WordCloud
 import matplotlib.pyplot as plt
 STOPWORDS = {'de', 'do', 'da', 'a', 'ou', 'em', 'e', 'o', 'que', 'das', 'dos', 'nos', 'nas', 'um', 'uma', 'para', 'com', 'não', 'uma', 'à', 'ao', 'os', 'as', 'se', 'na', 'no', 'de', 'do', 'da', 'é', 'ser', 'seu', 'sua', 'isso', 'the', 'of', 'and', 'in', 'to', 'is', 'for', 'on', 'with', 'at', 'by', 'from'}
 with open('context/basedosdados-schema.json') as f:
    schema = json.load(f)
 words = []
 for dataset, tables in schema.items():
    for table, cols in tables.items():
        for col in cols:
            name = col.get('name', '').lower()
            desc = col.get('description', '').lower()
            if name and len(name) >= 3:
                words.append(name)
            if desc:
                for w in desc.split():
                    w = re.sub(r'[^a-záàâãéèêíìîóòôõúùûç]', '', w)
                    if len(w) >= 3 and w not in STOPWORDS:
                        words.append(w)
 word_freq = Counter(words)
 wc = WordCloud(
    width=1600, 
    height=800, 
    background_color='white',
    max_words=200,
    colormap='viridis',
    min_font_size=8
 ).generate_from_frequencies(word_freq)
 plt.figure(figsize=(20, 10))
 plt.imshow(wc, interpolation='bilinear')
 plt.axis('off')
 plt.tight_layout(pad=0)
 plt.savefig('docs/wordcloud_attributes.png', dpi=150, bbox_inches='tight')
 print("Saved docs/wordcloud_attributes.png")
 print(f"Total unique words: {len(word_freq)}")
 print("Top 30:", word_freq.most_common(30))
--- a/docs/wordcloud_datasets.png
+++ b/docs/wordcloud_datasets.png
--- a/docs/wordcloud_datasets.py
+++ b/docs/wordcloud_datasets.py
@@ -0,0 +1,33 @@
 #!/usr/bin/env python3
 import json
 from collections import Counter
 from wordcloud import WordCloud
 import matplotlib.pyplot as plt
 with open('context/basedosdados-schema.json') as f:
    schema = json.load(f)
 dataset_names = []
 for dataset in schema.keys():
    parts = dataset.replace('br_', '').replace('mundo_', '').replace('eu_', '').split('_')
    dataset_names.extend([p for p in parts if len(p) >= 3])
 word_freq = Counter(dataset_names)
 wc = WordCloud(
    width=1600, 
    height=800, 
    background_color='white',
    max_words=100,
    colormap='plasma',
    min_font_size=10
 ).generate_from_frequencies(word_freq)
 plt.figure(figsize=(20, 10))
 plt.imshow(wc, interpolation='bilinear')
 plt.axis('off')
 plt.tight_layout(pad=0)
 plt.savefig('docs/wordcloud_datasets.png', dpi=150, bbox_inches='tight')
 print("Saved docs/wordcloud_datasets.png")
 print(f"Total unique words: {len(word_freq)}")
 print("Top 30:", word_freq.most_common(30))
--- a/gera_schemas.py
+++ b/gera_schemas.py
@@ -1,268 +0,0 @@
 import os
 import json
 import sys
 import pyarrow.parquet as pq
 import s3fs
 import boto3
 import duckdb
 from dotenv import load_dotenv
 load_dotenv()
 S3_ENDPOINT  = os.environ["HETZNER_S3_ENDPOINT"]
 S3_BUCKET    = os.environ["HETZNER_S3_BUCKET"]
 ACCESS_KEY   = os.environ["AWS_ACCESS_KEY_ID"]
 SECRET_KEY   = os.environ["AWS_SECRET_ACCESS_KEY"]
 s3_host = S3_ENDPOINT.removeprefix("https://").removeprefix("http://")
 # --- boto3 client (listing only, zero egress) ---
 boto = boto3.client(
    "s3",
    endpoint_url=S3_ENDPOINT,
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
 )
 # --- s3fs filesystem (footer-only reads via pyarrow) ---
 fs = s3fs.S3FileSystem(
    client_kwargs={"endpoint_url": S3_ENDPOINT},
    key=ACCESS_KEY,
    secret=SECRET_KEY,
 )
 # ------------------------------------------------------------------ #
 # Phase 1: File inventory via S3 List API (zero data egress)
 # ------------------------------------------------------------------ #
 print("Phase 1: listing S3 objects...")
 paginator = boto.get_paginator("list_objects_v2")
 inventory = {}  # "dataset/table" -> {files: [...], total_size: int}
 for page in paginator.paginate(Bucket=S3_BUCKET):
    for obj in page.get("Contents", []):
        key = obj["Key"]
        if not key.endswith(".parquet"):
            continue
        parts = key.split("/")
        if len(parts) < 3:
            continue
        dataset, table = parts[0], parts[1]
        dt = f"{dataset}/{table}"
        if dt not in inventory:
            inventory[dt] = {"files": [], "total_size_bytes": 0}
        inventory[dt]["files"].append(key)
        inventory[dt]["total_size_bytes"] += obj["Size"]
 print(f"  Found {len(inventory)} tables across {S3_BUCKET}")
 # ------------------------------------------------------------------ #
 # Phase 2: Schema reads — footer only (~30 KB per table)
 # ------------------------------------------------------------------ #
 print("Phase 2: reading parquet footers...")
 def fmt_size(b):
    for unit in ("B", "KB", "MB", "GB", "TB"):
        if b < 1024 or unit == "TB":
            return f"{b:.1f} {unit}"
        b /= 1024
 def extract_col_descriptions(schema):
    """Try to pull per-column descriptions from Arrow metadata."""
    descriptions = {}
    meta = schema.metadata or {}
    # BigQuery exports embed a JSON blob under b'pandas' with column_info
    pandas_meta_raw = meta.get(b"pandas") or meta.get(b"pandas_metadata")
    if pandas_meta_raw:
        try:
            pm = json.loads(pandas_meta_raw)
            for col in pm.get("columns", []):
                name = col.get("name")
                desc = col.get("metadata", {}) or {}
                if isinstance(desc, dict) and "description" in desc:
                    descriptions[name] = desc["description"]
        except Exception:
            pass
    # Also try top-level b'description' or b'schema'
    for key in (b"description", b"schema", b"BigQuery:description"):
        val = meta.get(key)
        if val:
            try:
                descriptions["__table__"] = val.decode("utf-8", errors="replace")
            except Exception:
                pass
    return descriptions
 schemas = {}
 errors = []
 for i, (dt, info) in enumerate(sorted(inventory.items())):
    dataset, table = dt.split("/", 1)
    first_file = info["files"][0]
    s3_path = f"{S3_BUCKET}/{first_file}"
    try:
        schema = pq.read_schema(fs.open(s3_path))
        col_descs = extract_col_descriptions(schema)
        # Build raw metadata dict (decode bytes keys/values)
        raw_meta = {}
        if schema.metadata:
            for k, v in schema.metadata.items():
                try:
                    dk = k.decode("utf-8", errors="replace")
                    dv = v.decode("utf-8", errors="replace")
                    # Try to parse JSON values
                    try:
                        dv = json.loads(dv)
                    except Exception:
                        pass
                    raw_meta[dk] = dv
                except Exception:
                    pass
        columns = []
        for field in schema:
            col = {
                "name": field.name,
                "type": str(field.type),
                "nullable": field.nullable,
            }
            if field.name in col_descs:
                col["description"] = col_descs[field.name]
            # Check field-level metadata
            if field.metadata:
                for k, v in field.metadata.items():
                    try:
                        dk = k.decode("utf-8", errors="replace")
                        dv = v.decode("utf-8", errors="replace")
                        if dk in ("description", "DESCRIPTION", "comment"):
                            col["description"] = dv
                    except Exception:
                        pass
            columns.append(col)
        schemas[f"{dataset}.{table}"] = {
            "path": f"s3://{S3_BUCKET}/{dataset}/{table}/",
            "file_count": len(info["files"]),
            "total_size_bytes": info["total_size_bytes"],
            "total_size_human": fmt_size(info["total_size_bytes"]),
            "columns": columns,
            "metadata": raw_meta,
        }
        print(f"  [{i+1}/{len(inventory)}] ✓ {dataset}.{table} ({len(columns)} cols, {fmt_size(info['total_size_bytes'])})")
    except Exception as e:
        errors.append({"table": f"{dataset}.{table}", "error": str(e)})
        print(f"  [{i+1}/{len(inventory)}] ✗ {dataset}.{table}: {e}", file=sys.stderr)
 # ------------------------------------------------------------------ #
 # Phase 3: Enrich from br_bd_metadados.bigquery_tables (small table)
 # ------------------------------------------------------------------ #
 META_TABLE = "br_bd_metadados.bigquery_tables"
 meta_dt = "br_bd_metadados/bigquery_tables"
 if meta_dt in inventory:
    print(f"Phase 3: enriching from {META_TABLE}...")
    try:
        con = duckdb.connect()
        con.execute("INSTALL httpfs; LOAD httpfs;")
        con.execute(f"""
            SET s3_endpoint='{s3_host}';
            SET s3_access_key_id='{ACCESS_KEY}';
            SET s3_secret_access_key='{SECRET_KEY}';
            SET s3_url_style='path';
        """)
        meta_path = f"s3://{S3_BUCKET}/br_bd_metadados/bigquery_tables/*.parquet"
        # Peek at available columns
        available = [r[0] for r in con.execute(f"DESCRIBE SELECT * FROM '{meta_path}' LIMIT 1").fetchall()]
        print(f"  Metadata columns: {available}")
        # Try to find dataset/table description columns
        desc_col = next((c for c in available if "description" in c.lower()), None)
        ds_col   = next((c for c in available if c.lower() in ("dataset_id", "dataset", "schema_name")), None)
        tbl_col  = next((c for c in available if c.lower() in ("table_id", "table_name", "table")), None)
        if desc_col and ds_col and tbl_col:
            rows = con.execute(f"""
                SELECT {ds_col}, {tbl_col}, {desc_col}
                FROM '{meta_path}'
            """).fetchall()
            for ds, tbl, desc in rows:
                key = f"{ds}.{tbl}"
                if key in schemas and desc:
                    schemas[key]["table_description"] = desc
            print(f"  Enriched {len(rows)} table descriptions")
        else:
            print(f"  Could not find expected columns (dataset_id, table_id, description) — skipping enrichment")
        con.close()
    except Exception as e:
        print(f"  Enrichment failed: {e}", file=sys.stderr)
 else:
    print("Phase 3: br_bd_metadados.bigquery_tables not in S3 — skipping enrichment")
 # ------------------------------------------------------------------ #
 # Phase 4a: Write schemas.json
 # ------------------------------------------------------------------ #
 print("Phase 4: writing outputs...")
 output = {
    "_meta": {
        "bucket": S3_BUCKET,
        "total_tables": len(schemas),
        "total_size_bytes": sum(v["total_size_bytes"] for v in schemas.values()),
        "total_size_human": fmt_size(sum(v["total_size_bytes"] for v in schemas.values())),
        "errors": errors,
    },
    "tables": dict(sorted(schemas.items())),
 }
 with open("schemas.json", "w", encoding="utf-8") as f:
    json.dump(output, f, ensure_ascii=False, indent=2)
 print(f"  ✓ schemas.json ({len(schemas)} tables)")
 # ------------------------------------------------------------------ #
 # Phase 4b: Write file_tree.md
 # ------------------------------------------------------------------ #
 lines = [
    f"# S3 File Tree: {S3_BUCKET}",
    "",
 ]
 # Group by dataset
 datasets_map = {}
 for dt_key, info in sorted(inventory.items()):
    dataset, table = dt_key.split("/", 1)
    datasets_map.setdefault(dataset, []).append((table, info))
 total_files  = sum(len(v["files"]) for v in inventory.values())
 total_bytes  = sum(v["total_size_bytes"] for v in inventory.values())
 for dataset, tables in sorted(datasets_map.items()):
    ds_bytes = sum(i["total_size_bytes"] for _, i in tables)
    ds_files = sum(len(i["files"]) for _, i in tables)
    lines.append(f"## {dataset}/  ({len(tables)} tables, {fmt_size(ds_bytes)}, {ds_files} files)")
    lines.append("")
    for table, info in sorted(tables):
        schema_entry = schemas.get(f"{dataset}.{table}", {})
        ncols = len(schema_entry.get("columns", []))
        col_str = f", {ncols} cols" if ncols else ""
        table_desc = schema_entry.get("table_description", "")
        desc_str = f" — {table_desc}" if table_desc else ""
        lines.append(f"  - **{table}/**  ({len(info['files'])} files, {fmt_size(info['total_size_bytes'])}{col_str}){desc_str}")
    lines.append("")
 lines += [
    "---",
    f"**Total: {len(inventory)} tables · {fmt_size(total_bytes)} · {total_files} parquet files**",
 ]
 with open("file_tree.md", "w", encoding="utf-8") as f:
    f.write("\n".join(lines) + "\n")
 print(f"  ✓ file_tree.md ({len(inventory)} tables)")
 print()
 print("Done!")
 print(f"  schemas.json  — full column-level schema dump")
 print(f"  file_tree.md  — bucket tree with sizes")
 if errors:
    print(f"  {len(errors)} tables failed (see schemas.json _meta.errors)")
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,4 +0,0 @@
 duckdb
 boto3
 python-dotenv
 openai
--- a/scripts/build_ask.sh
+++ b/scripts/build_ask.sh
@@ -0,0 +1,42 @@
 #!/bin/bash
 set -e
 cd "$(dirname "$0")"
 echo "=== Building ask binary for Linux x86_64 ==="
 echo "Using Debian x86_64 container for native build..."
 # Build in an x86_64 Debian container - this gives us a real x86_64 environment
 # so we can build natively without cross-compilation complexity
 # Use ask/ as context to avoid .dockerignore excluding src/
 docker build \
    --platform linux/amd64 \
    -t ask-builder \
    --build-arg BUILDKIT_INLINE_CACHE=1 \
    -f - ask/ <<'EOF'
 FROM rust:1.85-slim
 RUN apt-get update -qq && \
    apt-get install -y --no-install-recommends \
        build-essential pkg-config libssl-dev && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
 WORKDIR /build
 COPY . ./
 RUN cargo build --release --locked
 FROM scratch
 COPY --from=0 /build/target/release/ask /ask
 EOF
 echo "=== Extracting binary ==="
 # Extract the binary from the container
 docker run --rm --platform linux/amd64 ask-builder cat /ask > ./ask/target/release/ask
 # Make it executable
 chmod +x ./ask/target/release/ask
 echo "=== Binary built successfully ==="
 file ./ask/target/release/ask
 ls -lh ./ask/target/release/ask
--- a/scripts/failed_tables.txt
+++ b/scripts/failed_tables.txt
--- a/scripts/gera_schemas.py
+++ b/scripts/gera_schemas.py
--- a/scripts/prepara_db.py
+++ b/scripts/prepara_db.py
--- a/scripts/roda.sh
+++ b/scripts/roda.sh
@@ -62,7 +62,8 @@ if $GCLOUD_RUN; then
      exit 1
    fi
  done
-else
+elif ! $SYNC_RUN; then
  # Only require heavy GCP tools for the main export (not for --sync)
  for cmd in bq gcloud gsutil parallel rclone flock; do
    if ! command -v "$cmd" &>/dev/null; then
      log_err "'$cmd' not found. Install google-cloud-sdk, GNU parallel, and rclone."
@@ -164,8 +165,8 @@ echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.clou
  | sudo tee /etc/apt/sources.list.d/google-cloud-sdk.list >/dev/null
 sudo apt-get update -qq
 sudo apt-get install -y google-cloud-cli
-chmod +x ~/roda.sh
+  chmod +x ~/roda.sh
-echo "Dependencies installed."
+  echo "Dependencies installed."
 REMOTE_SETUP
  log "  Dependencies ready."
@@ -197,6 +198,121 @@ REMOTE_SETUP
  exit 0
 fi
 # =============================================================================
 # VM EXPORT — use existing bd-export-vm to export specific tables to GCS → S3
 # =============================================================================
 if [[ "${1:-}" == "--vm-export" ]]; then
  VM_NAME="${GCP_VM_NAME:-bd-export-vm}"
  VM_ZONE="${GCP_VM_ZONE:-us-central1-a}"
  VM_PROJECT="${GCP_VM_PROJECT:-raspa-491716}"
  TABLE_LIST="${2:-missing_tables.txt}"
  log "=============================="
  log " VM EXPORT MODE"
  log "  VM: $VM_NAME ($VM_ZONE)"
  log "  Tables: $TABLE_LIST"
  log "=============================="
  if [[ ! -f "$TABLE_LIST" ]]; then
    log_err "Table list not found: $TABLE_LIST"
    exit 1
  fi
  log "[1/5] Syncing files to VM..."
  gcloud compute scp \
    "$(dirname "$0")/roda.sh" \
    "$(dirname "$0")/.env" \
    "$(realpath "$TABLE_LIST")" \
    "$VM_NAME:~/" \
    --zone="$VM_ZONE" \
    --project="$VM_PROJECT"
  log "[2/5] Ensuring GCS bucket exists..."
  if ! gsutil ls "gs://$BUCKET_NAME" &>/dev/null; then
    gsutil mb -p "$VM_PROJECT" -l "$BUCKET_REGION" -b on "gs://$BUCKET_NAME"
    log "  Bucket created: gs://$BUCKET_NAME"
  else
    log "  Bucket already exists."
  fi
  log "[3/5] Running export on VM (bq extract + rclone)..."
  gcloud compute ssh "$VM_NAME" \
    --zone="$VM_ZONE" \
    --project="$VM_PROJECT" \
    --command="bash -s" <<'REMOTE_EXPORT'
 set -euo pipefail
 export DEBIAN_FRONTEND=noninteractive
 cd ~
 set -a
 # shellcheck source=.env
 source .env
 set +a
 source ~/.bashrc 2>/dev/null || true
 export RCLONE_CONFIG_BD_TYPE="google cloud storage"
 export RCLONE_CONFIG_BD_BUCKET_POLICY_ONLY="true"
 export RCLONE_CONFIG_HZ_TYPE="s3"
 export RCLONE_CONFIG_HZ_PROVIDER="Other"
 export RCLONE_CONFIG_HZ_ENDPOINT="$HETZNER_S3_ENDPOINT"
 export RCLONE_CONFIG_HZ_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID"
 export RCLONE_CONFIG_HZ_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY"
 echo "[BQ EXTRACT] Starting export of missing tables..."
 extract_table() {
  local table="$1"
  local dataset table_id gcs_prefix
  dataset=$(echo "$table" | cut -d. -f1)
  table_id=$(echo "$table" | cut -d. -f2)
  gcs_prefix="gs://$BUCKET_NAME/$dataset/$table_id"
  echo "[EXTRACT] $table"
  bq extract \
    --project_id="$YOUR_PROJECT" \
    --destination_format=PARQUET \
    --compression=ZSTD \
    --location=US \
    "${SOURCE_PROJECT}:${dataset}.${table_id}" \
    "${gcs_prefix}/*.parquet" 2>&1 \
    || echo "[FAIL] $table"
 }
 export -f extract_table
 export BUCKET_NAME SOURCE_PROJECT
 cat missing_tables.txt | parallel -j8 --bar extract_table {}
 echo "[TRANSFER] GCS → Hetzner S3..."
 datasets=$(gsutil ls "gs://$BUCKET_NAME/" 2>/dev/null | sed 's|gs://[^/]*/||;s|/$||' | grep -v '^$' | sort -u)
 for ds in $datasets; do
  echo "[TRANSFER] $ds"
  rclone copy "bd:$BUCKET_NAME/$ds/" "hz:$HETZNER_S3_BUCKET/$ds/" \
    --transfers 32 --s3-upload-concurrency 32 --progress 2>&1 \
    || echo "[FAIL_TRANSFER] $ds"
 done
 echo "[DONE] Export complete."
 REMOTE_EXPORT
  log "[4/5] Verifying transfer..."
  S3_COUNT=$(gcloud compute ssh "$VM_NAME" \
    --zone="$VM_ZONE" \
    --project="$VM_PROJECT" \
    --command="source .env && rclone ls hz:\$HETZNER_S3_BUCKET 2>/dev/null | grep -c '\.parquet\$' || echo 0" 2>/dev/null)
  log "  S3 parquet files: $S3_COUNT"
  log "[5/5] Cleaning up GCS bucket..."
  read -rp "Delete GCS bucket gs://$BUCKET_NAME? [y/N] " confirm
  if [[ "$confirm" =~ ^[Yy]$ ]]; then
    gsutil -m rm -r "gs://$BUCKET_NAME"
    gsutil rb "gs://$BUCKET_NAME"
    log "  Bucket deleted."
  fi
  log "VM export complete."
  exit 0
 fi
 # =============================================================================
 # SYNC — BigQuery → S3 direct (no GCS intermediary)
 # =============================================================================
--- a/scripts/sample_datasets.py
+++ b/scripts/sample_datasets.py
--- a/shell/Caddyfile
+++ b/shell/Caddyfile
--- a/shell/auth.py
+++ b/shell/auth.py
--- a/shell/haloy.yml
+++ b/shell/haloy.yml
--- a/start.sh
+++ b/start.sh
@@ -19,13 +19,13 @@ SQL
 chmod 600 /app/ssh_init.sql
 echo "[start] Starting ttyd terminal (db)..."
-ttyd --port 7681 --writable duckdb -readonly --init /app/ssh_init.sql /app/basedosdados.duckdb &
+ttyd --port 7681 --writable duckdb -readonly --init /app/ssh_init.sql /app/data/basedosdados.duckdb &
 echo "[start] Starting ttyd terminal (ask)..."
-ttyd --port 7682 --writable python3 /app/ask.py &
+ttyd --port 7682 --writable /app/ask &
 echo "[start] Starting auth service..."
-python3 /app/auth.py &
+python3 /app/shell/auth.py &
 echo "[start] Starting Caddy..."
 exec caddy run --config /app/Caddyfile --adapter caddyfile
--- a/sync_bq_to_local.py
+++ b/sync_bq_to_local.py
@@ -1,543 +0,0 @@
 #!/usr/bin/env python3
 """
 sync_bq_to_local.py
 Syncs missing tables from BigQuery (basedosdados project) to Hetzner S3,
 then registers them as DuckDB views.
 Usage:
    python3 sync_bq_to_local.py              # full sync
    python3 sync_bq_to_local.py --dry-run    # list missing tables only
    python3 sync_bq_to_local.py --resume      # resume from last run
 Prerequisites:
    gcloud auth application-default login
    GCP project with billing enabled (free tier: 1 TB/month)
 Environment (.env):
    GCP_PROJECT          - GCP project ID for billing
    HETZNER_S3_BUCKET   - S3 bucket name
    HETZNER_S3_ENDPOINT - S3 endpoint URL
    AWS_ACCESS_KEY_ID    - S3 access key
    AWS_SECRET_ACCESS_KEY - S3 secret key
 """
 import os
 import sys
 import json
 import argparse
 import logging
 import subprocess
 from datetime import datetime
 from pathlib import Path
 from collections import defaultdict
 from concurrent.futures import ThreadPoolExecutor, as_completed
 import boto3
 from botocore.config import Config as BotoConfig
 from google.cloud import bigquery
 # ---------------------------------------------------------------------------
 # Logging
 # ---------------------------------------------------------------------------
 LOG_FILE = f"sync_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
 logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
    handlers=[
        logging.FileHandler(LOG_FILE),
        logging.StreamHandler(sys.stdout),
    ],
 )
 log = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # Constants
 # ---------------------------------------------------------------------------
 SOURCE_PROJECT = "basedosdados"
 MISSING_TABLES_FILE = "tasks/datasets_to_scrap.md"
 DONE_FILE = "done_sync.txt"
 FAILED_FILE = "failed_sync.txt"
 DATA_DIR = "data"
 PARQUET_DIR = "parquet"
 MAX_RETRIES = 3
 BATCH_SIZE = 1  # export one table at a time to manage memory
 WORKERS = 4  # parallel uploads
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 def load_env():
    """Load required environment variables."""
    from dotenv import load_dotenv
    load_dotenv()
    required = [
        "GCP_PROJECT",
        "HETZNER_S3_BUCKET",
        "HETZNER_S3_ENDPOINT",
        "AWS_ACCESS_KEY_ID",
        "AWS_SECRET_ACCESS_KEY",
    ]
    missing = [v for v in required if not os.environ.get(v)]
    if missing:
        log.error("Missing env vars: %s", missing)
        sys.exit(1)
    return {v: os.environ[v] for v in required}
 def get_s3_client(env):
    """Create boto3 S3 client configured for Hetzner."""
    return boto3.client(
        "s3",
        endpoint_url=env["HETZNER_S3_ENDPOINT"],
        aws_access_key_id=env["AWS_ACCESS_KEY_ID"],
        aws_secret_access_key=env["AWS_SECRET_ACCESS_KEY"],
        config=BotoConfig(s3={"addressing_style": "path"}),
    )
 def get_bq_client():
    """Create BigQuery client using Application Default Credentials."""
    try:
        os.environ["GOOGLE_CLOUD_PROJECT"] = os.environ.get("GCP_PROJECT", "")
        os.environ["GCLOUD_PROJECT"] = os.environ.get("GCP_PROJECT", "")
        client = bigquery.Client(project=os.environ.get("GCP_PROJECT", ""))
        # Test the connection
        list(client.list_datasets(max_results=1))
        return client
    except Exception as e:
        log.error("BigQuery auth failed: %s", e)
        log.error("")
        log.error("Run these commands to authenticate:")
        log.error("  gcloud auth login")
        log.error("  gcloud auth application-default login")
        log.error("  gcloud config set project %s", os.environ.get("GCP_PROJECT", ""))
        log.error("")
        log.error("The free tier (1 TB/month) is sufficient — no credit card needed.")
        sys.exit(1)
 def list_bq_tables(bq_client):
    """List all tables in the basedosdados BigQuery project."""
    log.info("Discovering tables in BigQuery project: %s", SOURCE_PROJECT)
    tables = {}
    try:
        datasets = list(bq_client.list_datasets())
        log.info("Found %d datasets", len(datasets))
    except Exception as e:
        log.error("Failed to list datasets: %s", e)
        sys.exit(1)
    for dataset in datasets:
        try:
            tables_list = list(
                bq_client.list_tables(
                    f"{SOURCE_PROJECT}.{dataset.dataset_id}",
                    max_results=10000,
                )
            )
            for t in tables_list:
                tables[f"{dataset.dataset_id}.{t.table_id}"] = {
                    "dataset": dataset.dataset_id,
                    "table": t.table_id,
                    "full_id": f"{SOURCE_PROJECT}.{dataset.dataset_id}.{t.table_id}",
                    "schema": [f.name for f in t.schema] if t.schema else [],
                    "num_bytes": t.num_bytes,
                    "num_rows": t.num_rows,
                }
        except Exception as e:
            log.warning("Failed to list tables in dataset %s: %s", dataset.dataset_id, e)
    log.info("Total BigQuery tables discovered: %d", len(tables))
    return tables
 def list_s3_tables(s3_client, bucket):
    """List datasets/tables already exported to S3."""
    log.info("Discovering tables already in S3 bucket: %s", bucket)
    table_files = defaultdict(lambda: defaultdict(list))
    try:
        paginator = s3_client.get_paginator("list_objects_v2")
        for page in paginator.paginate(Bucket=bucket):
            for obj in page.get("Contents", []):
                key = obj["Key"]
                if not key.endswith(".parquet"):
                    continue
                parts = key.split("/")
                if len(parts) >= 3:
                    dataset, table = parts[0], parts[1]
                    table_files[dataset][table].append(key)
    except Exception as e:
        log.warning("S3 listing error (may be empty bucket): %s", e)
    tables = {}
    for dataset, t_dict in table_files.items():
        for table, files in t_dict.items():
            tables[f"{dataset}.{table}"] = files
    log.info("Total S3 tables discovered: %d", len(tables))
    return tables
 def parse_missing_tables_from_md(filepath):
    """Parse the missing tables from tasks/datasets_to_scrap.md.
    Returns a dict mapping 'dataset.table' -> description.
    Falls back to None (use all non-S3 tables) if file not found.
    """
    if not os.path.exists(filepath):
        log.warning("Missing file %s, using all non-S3 tables", filepath)
        return None
    log.info("Parsing missing tables from %s", filepath)
    with open(filepath) as f:
        content = f.read()
    missing = {}
    lines = content.split("\n")
    i = 0
    def next_nonempty(lines, i):
        while i < len(lines) and not lines[i].strip():
            i += 1
        return i
    while i < len(lines):
        line = lines[i].strip()
        # Find the Basedosdados.org section
        if "Basedosdados.org" in line and "Not in basedosdados.duckdb" in line:
            log.info("Found Basedosdados.org section at line %d", i + 1)
            i += 1
            break
        i += 1
    # Now parse table entries
    while i < len(lines):
        line = lines[i].strip()
        # End of section only on top-level ## headers, not ### subsections
        if line.startswith("## "):
            break
        # Skip separators and empty lines
        if not line or line.startswith("---") or "|---" in line:
            i += 1
            continue
        # Find rows with backtick-wrapped dataset names (e.g. | `br_abrinq_oca` | ...)
        if "`" in line and "|" in line:
            # Split by pipe, strip whitespace and backticks
            parts = [p.strip().strip("`").strip() for p in line.split("|")]
            # Filter empty parts
            parts = [p for p in parts if p]
            if len(parts) >= 2:
                dataset_raw = parts[0]
                # Check if it looks like a dataset name (br_*, eu_*, mundo_*, etc.)
                is_dataset = any(
                    dataset_raw.startswith(prefix)
                    for prefix in ("br_", "eu_", "mundo_", "nl_", "world_")
                )
                if is_dataset:
                    # parts[1] contains the missing table names (comma-separated)
                    tables_raw = parts[1]
                    for tbl in tables_raw.split(","):
                        tbl = tbl.strip()
                        # Clean up: remove parenthetical notes, trailing text
                        if "(" in tbl:
                            tbl = tbl.split("(")[0].strip()
                        if tbl and not tbl.startswith("-"):
                            missing[f"{dataset_raw}.{tbl}"] = f"from {filepath}"
        i += 1
    log.info("Parsed %d missing table references from MD", len(missing))
    return missing if missing else None
 def compute_missing_tables(bq_tables, s3_tables, md_missing):
    """Compute which tables need to be synced."""
    if md_missing is None:
        log.info("No MD file, computing diff: BQ - S3")
        return [
            (table_id, info)
            for table_id, info in bq_tables.items()
            if table_id not in s3_tables
        ]
    log.info("Computing sync targets: MD missing tables not in S3")
    targets = []
    for key, info in bq_tables.items():
        if key in s3_tables:
            continue
        if key in md_missing:
            targets.append((key, info))
        else:
            # Table not in S3 but not in MD missing list
            # Check if its dataset is partially covered
            dataset = info["dataset"]
            table = info["table"]
            # If any table from this dataset is in MD missing, include it
            dataset_in_md = any(
                k.startswith(f"{dataset}.") and k.split(".", 1)[1] in md_missing
                for k in bq_tables
            )
            if not dataset_in_md:
                targets.append((key, info))
    return targets
 def estimate_size_mb(num_bytes):
    """Estimate size in MB."""
    if num_bytes is None:
        return "?"
    return f"{num_bytes / 1_048_576:.1f}"
 # ---------------------------------------------------------------------------
 # Export logic
 # ---------------------------------------------------------------------------
 def sync_table(args, table_id, info, dry_run=False):
    """Sync a single table: BQ → parquet → S3 → DuckDB view."""
    bq_client, s3_client, bucket = args
    dataset = info["dataset"]
    table = info["table"]
    full_id = info["full_id"]
    s3_key_prefix = f"{dataset}/{table}"
    if dry_run:
        size_mb = estimate_size_mb(info.get("num_bytes"))
        return True, f"[DRY] {dataset}.{table} (~{size_mb} MB)"
    # Step 1: Query from BigQuery
    log.info("Querying %s from BigQuery", full_id)
    query = f"SELECT * FROM `{full_id}`"
    try:
        query_job = bq_client.query(query, location="US")
        df = query_job.to_dataframe()
    except Exception as e:
        return False, f"BQ query failed for {table_id}: {e}"
    if df.empty:
        return True, f"[SKIP] {table_id} — empty table"
    if df.shape[0] > 10_000_000:
        log.warning("Table %s has %d rows — may be slow/memory-intensive", table_id, df.shape[0])
    # Step 2: Write to parquet in memory, then upload
    import io
    import pyarrow as pa
    import pyarrow.parquet as pq
    buffer = io.BytesIO()
    table_pa = pa.Table.from_pandas(df)
    # Write with zstd compression
    writer = pq.ParquetWriter(
        buffer,
        table_pa.schema,
        compression="zstd",
        use_dictionary=True,
    )
    writer.write_table(table_pa)
    writer.close()
    buffer.seek(0)
    s3_key = f"{s3_key_prefix}/{table}.parquet"
    log.info("Uploading %s → s3://%s/%s (%s, %d rows)",
             table_id, bucket, s3_key,
             f"{buffer.getbuffer().nbytes / 1_048_576:.1f} MB",
             df.shape[0])
    try:
        s3_client.upload_fileobj(
            buffer,
            bucket,
            s3_key,
            ExtraArgs={"ContentType": "application/octet-stream"},
        )
    except Exception as e:
        return False, f"S3 upload failed for {table_id}: {e}"
    log.info("[DONE] %s uploaded to s3://%s/%s", table_id, bucket, s3_key)
    return True, f"[DONE] {table_id}"
 def update_duckdb_view(env, table_id, info):
    """Register a new table as a DuckDB view over S3 parquet."""
    import duckdb
    dataset = info["dataset"]
    table = info["table"]
    bucket = env["HETZNER_S3_BUCKET"]
    endpoint = env["HETZNER_S3_ENDPOINT"].removeprefix("https://").removeprefix("http://")
    access_key = env["AWS_ACCESS_KEY_ID"]
    secret_key = env["AWS_SECRET_ACCESS_KEY"]
    # S3 path
    s3_path = f"s3://{bucket}/{dataset}/{table}/{table}.parquet"
    try:
        con = duckdb.connect("basedosdados.duckdb", read_only=False)
        con.execute("INSTALL httpfs; LOAD httpfs;")
        con.execute(f"SET s3_endpoint='{endpoint}';")
        con.execute(f"SET s3_access_key_id='{access_key}';")
        con.execute(f"SET s3_secret_access_key='{secret_key}';")
        con.execute(f"SET s3_url_style='path';")
        con.execute(f"CREATE SCHEMA IF NOT EXISTS {dataset}")
        con.execute(f"""
            CREATE OR REPLACE VIEW {dataset}.{table} AS
            SELECT * FROM read_parquet('{s3_path}', hive_partitioning=true, union_by_name=true)
        """)
        con.close()
        log.info("[DUCKDB] View created: %s.%s", dataset, table)
        return True, None
    except Exception as e:
        log.error("[DUCKDB] Failed to create view %s.%s: %s", dataset, table, e)
        return False, str(e)
 def run_sync(targets, args, env, dry_run=False, resume=False):
    """Run the sync for all target tables."""
    s3_client = get_s3_client(env)
    bq_client = get_bq_client()
    # Load done/failed tracking
    done_set = set()
    if resume:
        if os.path.exists(DONE_FILE):
            with open(DONE_FILE) as f:
                done_set = {l.strip() for l in f if l.strip()}
            log.info("Resuming: %d tables already done", len(done_set))
    failed_count = 0
    done_count = 0
    # Filter out already-done tables
    targets = [(tid, info) for tid, info in targets if tid not in done_set]
    if not targets:
        log.info("No tables to sync.")
        return 0, 0
    log.info("Syncing %d tables...", len(targets))
    for i, (table_id, info) in enumerate(targets, 1):
        log.info("--- [%d/%d] Syncing %s ---", i, len(targets), table_id)
        # Sync BQ → S3
        ok, msg = sync_table(
            (bq_client, s3_client, env["HETZNER_S3_BUCKET"]),
            table_id,
            info,
            dry_run=dry_run,
        )
        log.info(msg)
        if dry_run:
            continue
        if not ok:
            with open(FAILED_FILE, "a") as f:
                f.write(f"{table_id}\t{msg}\n")
            failed_count += 1
            continue
        if "empty" in msg.lower():
            continue
        # Update DuckDB view
        ok, err = update_duckdb_view(env, table_id, info)
        if not ok:
            with open(FAILED_FILE, "a") as f:
                f.write(f"{table_id}\tDUCKDB: {err}\n")
        # Mark done
        with open(DONE_FILE, "a") as f:
            f.write(f"{table_id}\n")
        done_count += 1
    return done_count, failed_count
 # ---------------------------------------------------------------------------
 # Main
 # ---------------------------------------------------------------------------
 def main():
    parser = argparse.ArgumentParser(description="Sync missing BQ tables to S3")
    parser.add_argument("--dry-run", action="store_true", help="List tables without syncing")
    parser.add_argument("--resume", action="store_true", help="Resume from last run")
    args = parser.parse_args()
    env = load_env()
    dry_run = args.dry_run
    if dry_run:
        log.info("=== DRY RUN MODE ===")
    # Step 1: List BigQuery tables
    bq_client = get_bq_client()
    bq_tables = list_bq_tables(bq_client)
    # Step 2: List S3 tables
    s3_client = get_s3_client(env)
    s3_tables = list_s3_tables(s3_client, env["HETZNER_S3_BUCKET"])
    # Step 3: Parse missing tables from MD
    md_missing = parse_missing_tables_from_md(MISSING_TABLES_FILE)
    # Step 4: Compute targets
    targets = compute_missing_tables(bq_tables, s3_tables, md_missing)
    if not targets:
        log.info("No tables to sync.")
        return
    log.info("")
    log.info("============================================")
    log.info(" Tables to sync: %d", len(targets))
    log.info("============================================")
    for i, (table_id, info) in enumerate(targets, 1):
        size_mb = estimate_size_mb(info.get("num_bytes"))
        md_note = md_missing.get(table_id, "")
        log.info("  [%d] %-50s %6s MB  %s", i, table_id, size_mb, md_note)
    log.info("")
    if dry_run:
        total_bytes = sum(info.get("num_bytes", 0) or 0 for _, info in targets)
        total_gb = total_bytes / 1_073_741_824
        log.info("Total estimated size: %.2f GB (BigQuery compressed bytes)", total_gb)
        log.info("Run without --dry-run to start syncing.")
        return
    # Step 5: Run sync
    log.info("Starting sync...")
    done_count, failed_count = run_sync(targets, None, env, dry_run=False, resume=args.resume)
    log.info("")
    log.info("============================================")
    log.info(" Sync complete!")
    log.info(" Done:    %d tables", done_count)
    log.info(" Failed:  %d tables", failed_count)
    log.info(" Log:     %s", LOG_FILE)
    log.info("============================================")
    if failed_count > 0:
        log.info("Failed tables: see %s", FAILED_FILE)
        sys.exit(1)
 if __name__ == "__main__":
    main()
--- a/tasks/all_tables.txt
+++ b/tasks/all_tables.txt
--- a/tasks/datasets_to_scrap.md
+++ b/tasks/datasets_to_scrap.md
@@ -143,11 +143,36 @@ Sources from https://github.com/jxnxts/mcp-brasil not in `basedosdados.duckdb`.
 | INPE | `inpe` | none | `https://terrabrasilis.dpi.inpe.br/queimadas/bdqueimadas-data-service` | JSON |
 | Tabua Mares | `tabua_mares` | none | `https://tabuademares.com/api/v2` | JSON |
-## Basedosdados.org — Not in basedosdados.duckdb (232 tables)
+## Basedosdados.org — Not in basedosdados.duckdb
-Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and thus in your duckdb). The following datasets have **zero or partial** tables in duckdb.
+Basedosdados.org has **765 tables** on BigQuery, **~533** on S3. The remaining gap:
-### Full datasets — no tables in duckdb
+- **2 TABLEs** need `bq extract` → GCS → S3 (waiting on GCP billing restore)
 - **~230 are VIEWs** → need `bq query` to materialize, then `bq extract` (or streaming write to S3)
 - **3 tables MISSING** from BQ entirely (br_bcb_sicor microdados_* don't exist)
 ### Need export — 2 TABLEs blocked on GCP billing
 | Dataset | Table | BQ Type | Notes |
 |---------|-------|---------|-------|
 | `br_bcb_taxa_cambio` | taxa_cambio | TABLE | ✅ `bq extract` works |
 | `br_bcb_taxa_selic` | taxa_selic | TABLE | ✅ `bq extract` works |
 ### Already on S3 (no action needed)
 | Dataset | Tables |
 |---------|--------|
 | `br_bd_metadados` | bigquery_tables, prefect_flow_runs |
 | `br_fbsp_absp` | uf, violencia_escola |
 | `br_ibge_estadic` | dicionario |
 | `br_camara_dados_abertos` | all 33 tables (222 parquet files) |
 | `br_me_rais` | dicionario, microdados_estabelecimentos, microdados_vinculos |
 ### ~230 VIEWs — need bq query materialization pipeline
 Cannot `bq extract` directly. Need to: (1) materialize via `bq query --destination_table`, or (2) stream via Python Arrow → S3 directly.
 #### Full datasets (all VIEWs)
 | Dataset | Tables missing | Notes |
 |---------|----------------|-------|
@@ -157,7 +182,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_anvisa_medicamentos_industrializados` | microdados | |
 | `br_ba_feiradesantana_camara_leis` | microdados | |
 | `br_bd_diretorios_data_tempo` | tempo, data, ano, mes, dia, hora, bimestre, trimestre, semestre, minuto, segundo | Directory of time dimensions |
-| `br_bd_metadados` | external_links, information_requests, organizations, prefect_flows, resources, tables | BD metadata catalog |
+| `br_bd_metadados` | external_links, information_requests, organizations, resources, tables | |
 | `br_bd_vizinhanca` | municipio, uf | |
 | `br_caixa_sorteios` | megasena | |
 | `br_camara_dados_abertos` | sigla_partido | |
@@ -179,7 +204,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_ieps_saude` | brasil, macrorregiao, municipio, regiao_saude, uf | |
 | `br_imprensa_nacional_dou` | secao_1, secao_2, secao_3 | Official gazette sections |
 | `br_ipea_acesso_oportunidades` | estatisticas_2019, indicadores_2019 | |
-| `br_mapbiomas_estatisticas` | classe, cobertura_municipio_classe, cobertura_uf_classe, transicao_municipio_de_para_anual/decenal/quinquenal, transicao_uf_de_para_anual/decenal/quinquenal | |
+| `br_mapbiomas_estatisticas` | classe, cobertura_municipio_classe, cobertura_uf_classe, transicao_*(anual/decenal/quinquenal) | |
 | `br_mc_indicadores` | transferencias_municipio | |
 | `br_me_clima_organizacional` | microdados | |
 | `br_me_estoque_divida_publica` | microdados | |
@@ -188,7 +213,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_me_siape` | servidores_executivo_federal | |
 | `br_me_siorg` | remuneracao | |
 | `br_mma_extincao` | fauna_ameacada, flora_ameacada | |
-| `br_mobilidados_indicadores` | 11 tables (comprometimento_renda_tarifa_transp_publico, proporcao_*, taxa_motorizacao, etc.) | |
+| `br_mobilidados_indicadores` | 11 tables | |
 | `br_ms_atencao_basica` | municipio | |
 | `br_ms_imunizacoes` | municipio | |
 | `br_ons_energia_armazenada` | subsistemas | |
@@ -219,18 +244,16 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `world_ti_corruption_perception` | country | |
 | `world_wb_wwbi` | country_finance, country_indicators | |
-### Partial datasets — some tables in duckdb, some missing
+#### Partial datasets — missing tables (all VIEWs, except where noted)
 | Dataset | Missing tables | In duckdb |
 |---------|----------------|-----------|
 | `br_anatel_banda_larga_fixa` | backhaul, pble | densidade_*, microdados |
-| `br_bcb_sicor` | microdados_liberacao, microdados_operacao, microdados_saldo | dicionario, liberacao, operacao, saldo, recurso_publico_* |
+| `br_bcb_sicor` | microdados_liberacao, microdados_operacao, microdados_saldo | dicionario, liberacao, operacao, saldo, + 5 more TABLEs |
 | `br_bcb_taxa_cambio` | taxa_cambio | — (ACCESS_DENIED) |
 | `br_bcb_taxa_selic` | taxa_selic | — (ACCESS_DENIED) |
 | `br_ibge_pib` | brasil_antigo, municipio_antigo, regiao_antigo, uf, uf_antigo | gini, municipio |
 | `br_ibge_pnad_covid` | microdados | dicionario |
-| `br_ibge_pnadc` | ano_brasil_grupo_idade, ano_brasil_raca_cor, ano_municipio_*, ano_regiao_*, ano_uf_* (cross-tabs) | dicionario, educacao, microdados, rendimentos_outras_fontes |
+| `br_ibge_pnadc` | 10 cross-tab tables (ano_*) | dicionario, educacao, microdados, rendimentos_outras_fontes |
-| `br_ibge_pof` | all 17 tables (morador, domicilio, despesa_*, consumo_*, etc.) | none |
+| `br_ibge_pof` | all 17 tables (morador_*, domicilio_*, despesa_*, consumo_*, etc.) | none |
 | `br_inep_ana` | aluno, escola, prova | dicionario |
 | `br_inep_censo_escolar` | docente, matricula | dicionario, escola, turma |
 | `br_inep_formacao_docente` | brasil, escola, municipio, regiao, uf | dicionario |
@@ -238,8 +261,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_inep_indicadores_educacionais` | escola_nivel_socioeconomico, fluxo_educacao_superior | all others |
 | `br_inmet_bdmep` | estacao | microdados |
 | `br_me_caged` | microdados_antigos, microdados_antigos_ajustes | dicionario, microdados_movimentacao* |
-| `br_me_cno` | microdados, microdados_cnae, microdados_vinculo | dicionario, microdados |
+| `br_me_cno` | microdados_cnae, microdados_vinculo | dicionario, microdados |
 | `br_me_rais` | all tables | dicionario, microdados_estabelecimentos, microdados_vinculos |
 | `br_mec_prouni` | microdados | dicionario |
 | `br_ms_sim` | municipio, municipio_causa, municipio_causa_idade, municipio_causa_idade_sexo_raca | dicionario, microdados |
 | `br_ms_sinan` | microdados_violencia | dicionario, microdados_dengue, microdados_influenza_srag |
@@ -247,3 +269,13 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
 | `br_seeg_emissoes` | brasil | dicionario, municipio, uf |
 | `br_tse_eleicoes` | local_secao | all others |
 | `world_oecd_pisa` | dictionary, school_summary, student_summary | student |
 ### Tables that don't exist in BigQuery (3)
 These were listed in datasets_to_scrap but actually don't exist in `basedosdados`:
 | Dataset | Table |
 |---------|-------|
 | `br_bcb_sicor` | microdados_liberacao |
 | `br_bcb_sicor` | microdados_operacao |
 | `br_bcb_sicor` | microdados_saldo |
--- a/tasks/missing_tables.txt
+++ b/tasks/missing_tables.txt
@@ -0,0 +1,270 @@
 br_abrinq_oca.municipio_primeira_infancia
 br_ana_atlas_esgotos.municipio
 br_ana_reservatorios.sin
 br_anvisa_medicamentos_industrializados.microdados
 br_ba_feiradesantana_camara_leis.microdados
 br_bd_diretorios_data_tempo.ano
 br_bd_diretorios_data_tempo.bimestre
 br_bd_diretorios_data_tempo.data
 br_bd_diretorios_data_tempo.dia
 br_bd_diretorios_data_tempo.hora
 br_bd_diretorios_data_tempo.mes
 br_bd_diretorios_data_tempo.minuto
 br_bd_diretorios_data_tempo.segundo
 br_bd_diretorios_data_tempo.semestre
 br_bd_diretorios_data_tempo.tempo
 br_bd_diretorios_data_tempo.trimestre
 br_bd_metadados.bigquery_tables
 br_bd_metadados.external_links
 br_bd_metadados.information_requests
 br_bd_metadados.organizations
 br_bd_metadados.prefect_flow_runs
 br_bd_metadados.resources
 br_bd_metadados.tables
 br_bd_vizinhanca.municipio
 br_bd_vizinhanca.uf
 br_caixa_sorteios.megasena
 br_camara_dados_abertos.deputado
 br_camara_dados_abertos.deputado_ocupacao
 br_camara_dados_abertos.deputado_profissao
 br_camara_dados_abertos.despesa
 br_camara_dados_abertos.evento
 br_camara_dados_abertos.evento_orgao
 br_camara_dados_abertos.evento_presenca_deputado
 br_camara_dados_abertos.evento_requerimento
 br_camara_dados_abertos.frente
 br_camara_dados_abertos.frente_deputado
 br_camara_dados_abertos.funcionario
 br_camara_dados_abertos.legislatura
 br_camara_dados_abertos.legislatura_mesa
 br_camara_dados_abertos.licitacao
 br_camara_dados_abertos.licitacao_contrato
 br_camara_dados_abertos.licitacao_item
 br_camara_dados_abertos.licitacao_pedido
 br_camara_dados_abertos.licitacao_proposta
 br_camara_dados_abertos.orgao
 br_camara_dados_abertos.orgao_deputado
 br_camara_dados_abertos.proposicao_autor
 br_camara_dados_abertos.proposicao_microdados
 br_camara_dados_abertos.proposicao_tema
 br_camara_dados_abertos.sigla_partido
 br_camara_dados_abertos.votacao
 br_camara_dados_abertos.votacao_objeto
 br_camara_dados_abertos.votacao_orientacao_bancada
 br_camara_dados_abertos.votacao_parlamentar
 br_camara_dados_abertos.votacao_proposicao
 br_capes_bolsas.mobilidade_internacional
 br_cgu_ebt.municipio
 br_cgu_ebt.uf
 br_cgu_fef.microdados
 br_cgu_fef.municipios_sorteados
 br_cgu_fef.sorteio
 br_cgu_pessoal_executivo_federal.terceirizados
 br_clp_ranking_competitividade.nota_geral_municipio
 br_clp_ranking_competitividade.nota_geral_uf
 br_cnj_estatisticas_poder_judiciario.recursos_financeiros
 br_fbsp_absp.municipio
 br_fbsp_absp.uf
 br_fbsp_absp.violencia_escola
 br_firjan_ifgf.ranking
 br_ggb_relatorio_lgbtqi.brasil
 br_ggb_relatorio_lgbtqi.causa_obito
 br_ggb_relatorio_lgbtqi.grupo_lgbtqia
 br_ggb_relatorio_lgbtqi.local
 br_ggb_relatorio_lgbtqi.raca_cor
 br_ibge_amc.municipio_de_para
 br_ibge_cbo_2002.perfil_ocupacional
 br_ibge_cbo_2002.sinonimo
 br_ibge_estadic.comunicacao_informatica
 br_ibge_estadic.dicionario
 br_ibge_estadic.educacao
 br_ibge_estadic.governanca
 br_ibge_estadic.indicadores_perfil_gestor
 br_ibge_estadic.indicadores_quantidade_vinculo
 br_ibge_estadic.politica_mulher
 br_ibge_estadic.recursos_humanos
 br_ibge_ipp.mes_categoria_economica
 br_ibge_ipp.mes_grupo_industrial
 br_ibge_ipp.mes_industria_atividade
 br_ibge_ipp.mes_industria_extrativa
 br_ibge_ipp.mes_industria_geral
 br_ibge_ipp.mes_industria_transformacao
 br_ibge_munic.indicadores_perfil_gestor
 br_ibge_munic.indicadores_quantidade_vinculo
 br_ibge_nomes_brasil.quantidade_municipio_nome_2010
 br_ieps_saude.brasil
 br_ieps_saude.macrorregiao
 br_ieps_saude.municipio
 br_ieps_saude.regiao_saude
 br_ieps_saude.uf
 br_imprensa_nacional_dou.secao_1
 br_imprensa_nacional_dou.secao_2
 br_imprensa_nacional_dou.secao_3
 br_ipea_acesso_oportunidades.estatisticas_2019
 br_ipea_acesso_oportunidades.indicadores_2019
 br_mapbiomas_estatisticas.classe
 br_mapbiomas_estatisticas.cobertura_municipio_classe
 br_mapbiomas_estatisticas.cobertura_uf_classe
 br_mapbiomas_estatisticas.transicao_municipio_de_para_anual
 br_mapbiomas_estatisticas.transicao_municipio_de_para_decenal
 br_mapbiomas_estatisticas.transicao_municipio_de_para_quinquenal
 br_mapbiomas_estatisticas.transicao_uf_de_para_anual
 br_mapbiomas_estatisticas.transicao_uf_de_para_decenal
 br_mapbiomas_estatisticas.transicao_uf_de_para_quinquenal
 br_mc_indicadores.transferencias_municipio
 br_me_clima_organizacional.microdados
 br_me_estoque_divida_publica.microdados
 br_me_exportadoras_importadoras.dicionario
 br_me_exportadoras_importadoras.estabelecimentos
 br_me_pensionistas.microdados
 br_me_siape.servidores_executivo_federal
 br_me_siorg.remuneracao
 br_mma_extincao.fauna_ameacada
 br_mma_extincao.flora_ameacada
 br_mobilidados_indicadores.comprometimento_renda_tarifa_transp_publico
 br_mobilidados_indicadores.divisao_modal
 br_mobilidados_indicadores.emissao_co2_material_particulado
 br_mobilidados_indicadores.proporcao_domicilios_infra_urbana
 br_mobilidados_indicadores.proporcao_mortes_negras_acidente_transporte
 br_mobilidados_indicadores.proporcao_pessoas_prox_infra_cicloviaria
 br_mobilidados_indicadores.proporcao_pessoas_proximas_pnt
 br_mobilidados_indicadores.taxa_motorizacao
 br_mobilidados_indicadores.tempo_deslocamento_casa_trabalho
 br_mobilidados_indicadores.transporte_media_alta_capacidade
 br_ms_atencao_basica.municipio
 br_ms_imunizacoes.municipio
 br_ons_energia_armazenada.subsistemas
 br_rj_rio_de_janeiro_ipp_ips.dimensoes_componentes
 br_rj_rio_de_janeiro_ipp_ips.indicadores
 br_rj_tce_iegm.indicadores
 br_senado_cpipandemia.discursos
 br_sgp_informacao.despesas_cartao_corporativo
 br_sp_alesp.assessores_lideranca
 br_sp_alesp.assessores_parlamentares
 br_sp_alesp.deputado
 br_sp_alesp.despesas_gabinete
 br_sp_alesp.despesas_gabinete_atual
 br_sp_gov_orcamento.despesa
 br_sp_gov_orcamento.receita_arrecadada
 br_sp_gov_orcamento.receita_prevista
 br_sp_gov_ssp.ocorrencias_registradas
 br_sp_gov_ssp.produtividade_policial
 br_sp_saopaulo_dieese_icv.ano
 br_sp_seduc_fluxo_escolar.escola
 br_sp_seduc_fluxo_escolar.municipio
 br_sp_seduc_idesp.diretoria
 br_sp_seduc_idesp.escola
 br_sp_seduc_idesp.uf
 br_sp_seduc_inse.escola
 br_tpe_classificacao_saeb.categoria
 eu_fra_lgbt.consciencia_direitos
 eu_fra_lgbt.cotidiano
 eu_fra_lgbt.discriminacao
 eu_fra_lgbt.especifico_transgenero
 eu_fra_lgbt.violencia_abuso
 mundo_bm_learning_poverty.pais
 mundo_kaggle_olimpiadas.microdados
 mundo_onu_adh.brasil
 mundo_onu_adh.municipio
 mundo_onu_adh.uf
 mundo_transrespect_transphobia.causa_obito
 mundo_transrespect_transphobia.local
 mundo_transrespect_transphobia.pais
 nl_ug_pwt.microdados
 world_fao_production.country_group
 world_fao_production.crop_livestock
 world_fao_production.dictionary
 world_fao_production.element
 world_fao_production.item
 world_fao_production.item_group
 world_fao_production.production_indices
 world_fao_production.value_agricultural_production
 world_fifa_women_world_cup.matches
 world_fifa_worldcup.award_winners
 world_fifa_worldcup.matches
 world_fifa_worldcup.players
 world_fifa_worldcup.teams
 world_fifa_worldcup.tournaments
 world_gsps_consortium_gsps.global_indicators
 world_slave_voyages_consortium_slave_trade.transatlantic
 world_spi_spi.global_indicators
 world_ti_corruption_perception.country
 world_wb_wwbi.country_finance
 world_wb_wwbi.country_indicators
 br_anatel_banda_larga_fixa.backhaul
 br_anatel_banda_larga_fixa.pble
 br_bcb_sicor.microdados_liberacao
 br_bcb_sicor.microdados_operacao
 br_bcb_sicor.microdados_saldo
 br_bcb_taxa_cambio.taxa_cambio
 br_bcb_taxa_selic.taxa_selic
 br_ibge_pib.brasil_antigo
 br_ibge_pib.municipio_antigo
 br_ibge_pib.regiao_antigo
 br_ibge_pib.uf
 br_ibge_pib.uf_antigo
 br_ibge_pnad_covid.microdados
 br_ibge_pnadc.ano_brasil_grupo_idade
 br_ibge_pnadc.ano_brasil_raca_cor
 br_ibge_pnadc.ano_municipio_grupo_idade
 br_ibge_pnadc.ano_municipio_raca_cor
 br_ibge_pnadc.ano_regiao_grupo_idade
 br_ibge_pnadc.ano_regiao_metropolitana_grupo_idade
 br_ibge_pnadc.ano_regiao_metropolitana_raca_cor
 br_ibge_pnadc.ano_regiao_raca_cor
 br_ibge_pnadc.ano_uf_grupo_idade
 br_ibge_pnadc.ano_uf_raca_cor
 br_ibge_pof.aluguel_estimado_2017
 br_ibge_pof.cadastro_de_produtos_2017
 br_ibge_pof.caderneta_coletiva_2017
 br_ibge_pof.caracteristicas_dieta_2017
 br_ibge_pof.condicoes_vida_2017
 br_ibge_pof.consumo_alimentar_2017
 br_ibge_pof.despesa_coletiva_2017
 br_ibge_pof.despesa_individual_2017
 br_ibge_pof.domicilio_2017
 br_ibge_pof.inventario_2017
 br_ibge_pof.morador_2017
 br_ibge_pof.outros_rendimentos_2017
 br_ibge_pof.rendimento_trabalho_2017
 br_ibge_pof.restricao_saude_2017
 br_ibge_pof.servico_nao_monetario_pof2_2017
 br_ibge_pof.servico_nao_monetario_pof4_2017
 br_inep_ana.aluno
 br_inep_ana.escola
 br_inep_ana.prova
 br_inep_censo_escolar.docente
 br_inep_censo_escolar.matricula
 br_inep_formacao_docente.brasil
 br_inep_formacao_docente.escola
 br_inep_formacao_docente.municipio
 br_inep_formacao_docente.regiao
 br_inep_formacao_docente.uf
 br_inep_indicador_nivel_socioeconomico.brasil
 br_inep_indicador_nivel_socioeconomico.municipio
 br_inep_indicador_nivel_socioeconomico.uf
 br_inep_indicadores_educacionais.escola_nivel_socioeconomico
 br_inep_indicadores_educacionais.fluxo_educacao_superior
 br_inmet_bdmep.estacao
 br_me_caged.microdados_antigos
 br_me_caged.microdados_antigos_ajustes
 br_me_cno.microdados_cnae
 br_me_cno.microdados_vinculo
 br_me_rais.dicionario
 br_me_rais.microdados_estabelecimentos
 br_me_rais.microdados_vinculos
 br_mec_prouni.microdados
 br_ms_sim.municipio
 br_ms_sim.municipio_causa
 br_ms_sim.municipio_causa_idade
 br_ms_sim.municipio_causa_idade_sexo_raca
 br_ms_sinan.microdados_violencia
 br_ms_vacinacao_covid19.microdados
 br_ms_vacinacao_covid19.microdados_estabelecimento
 br_ms_vacinacao_covid19.microdados_paciente
 br_ms_vacinacao_covid19.microdados_vacinacao
 br_seeg_emissoes.brasil
 br_tse_eleicoes.local_secao
 world_oecd_pisa.dictionary
 world_oecd_pisa.school_summary
 world_oecd_pisa.student_summary
--- a/tasks/pending_tables.txt
+++ b/tasks/pending_tables.txt
@@ -0,0 +1,2 @@
 br_bcb_taxa_cambio.taxa_cambio
 br_bcb_taxa_selic.taxa_selic
--- a/tasks/views_to_materialize.txt
+++ b/tasks/views_to_materialize.txt
@@ -0,0 +1,229 @@
 br_abrinq_oca.municipio_primeira_infancia
 br_ana_atlas_esgotos.municipio
 br_ana_reservatorios.sin
 br_anvisa_medicamentos_industrializados.microdados
 br_ba_feiradesantana_camara_leis.microdados
 br_bd_diretorios_data_tempo.ano
 br_bd_diretorios_data_tempo.bimestre
 br_bd_diretorios_data_tempo.data
 br_bd_diretorios_data_tempo.dia
 br_bd_diretorios_data_tempo.hora
 br_bd_diretorios_data_tempo.mes
 br_bd_diretorios_data_tempo.minuto
 br_bd_diretorios_data_tempo.segundo
 br_bd_diretorios_data_tempo.semestre
 br_bd_diretorios_data_tempo.tempo
 br_bd_diretorios_data_tempo.trimestre
 br_bd_metadados.external_links
 br_bd_metadados.information_requests
 br_bd_metadados.organizations
 br_bd_metadados.resources
 br_bd_metadados.tables
 br_bd_vizinhanca.municipio
 br_bd_vizinhanca.uf
 br_caixa_sorteios.megasena
 br_camara_dados_abertos.sigla_partido
 br_capes_bolsas.mobilidade_internacional
 br_cgu_ebt.municipio
 br_cgu_ebt.uf
 br_cgu_fef.microdados
 br_cgu_fef.municipios_sorteados
 br_cgu_fef.sorteio
 br_cgu_pessoal_executivo_federal.terceirizados
 br_clp_ranking_competitividade.nota_geral_municipio
 br_clp_ranking_competitividade.nota_geral_uf
 br_cnj_estatisticas_poder_judiciario.recursos_financeiros
 br_fbsp_absp.municipio
 br_firjan_ifgf.ranking
 br_ggb_relatorio_lgbtqi.brasil
 br_ggb_relatorio_lgbtqi.causa_obito
 br_ggb_relatorio_lgbtqi.grupo_lgbtqia
 br_ggb_relatorio_lgbtqi.local
 br_ggb_relatorio_lgbtqi.raca_cor
 br_ibge_amc.municipio_de_para
 br_ibge_cbo_2002.perfil_ocupacional
 br_ibge_cbo_2002.sinonimo
 br_ibge_estadic.comunicacao_informatica
 br_ibge_estadic.educacao
 br_ibge_estadic.governanca
 br_ibge_estadic.indicadores_perfil_gestor
 br_ibge_estadic.indicadores_quantidade_vinculo
 br_ibge_estadic.politica_mulher
 br_ibge_estadic.recursos_humanos
 br_ibge_ipp.mes_categoria_economica
 br_ibge_ipp.mes_grupo_industrial
 br_ibge_ipp.mes_industria_atividade
 br_ibge_ipp.mes_industria_extrativa
 br_ibge_ipp.mes_industria_geral
 br_ibge_ipp.mes_industria_transformacao
 br_ibge_munic.indicadores_perfil_gestor
 br_ibge_munic.indicadores_quantidade_vinculo
 br_ibge_nomes_brasil.quantidade_municipio_nome_2010
 br_ibge_pib.brasil_antigo
 br_ibge_pib.municipio_antigo
 br_ibge_pib.regiao_antigo
 br_ibge_pib.uf
 br_ibge_pib.uf_antigo
 br_ibge_pnad_covid.microdados
 br_ibge_pnadc.ano_brasil_grupo_idade
 br_ibge_pnadc.ano_brasil_raca_cor
 br_ibge_pnadc.ano_municipio_grupo_idade
 br_ibge_pnadc.ano_municipio_raca_cor
 br_ibge_pnadc.ano_regiao_grupo_idade
 br_ibge_pnadc.ano_regiao_metropolitana_grupo_idade
 br_ibge_pnadc.ano_regiao_metropolitana_raca_cor
 br_ibge_pnadc.ano_regiao_raca_cor
 br_ibge_pnadc.ano_uf_grupo_idade
 br_ibge_pnadc.ano_uf_raca_cor
 br_ibge_pof.aluguel_estimado_2017
 br_ibge_pof.cadastro_de_produtos_2017
 br_ibge_pof.caderneta_coletiva_2017
 br_ibge_pof.caracteristicas_dieta_2017
 br_ibge_pof.condicoes_vida_2017
 br_ibge_pof.consumo_alimentar_2017
 br_ibge_pof.despesa_coletiva_2017
 br_ibge_pof.despesa_individual_2017
 br_ibge_pof.domicilio_2017
 br_ibge_pof.inventario_2017
 br_ibge_pof.morador_2017
 br_ibge_pof.outros_rendimentos_2017
 br_ibge_pof.rendimento_trabalho_2017
 br_ibge_pof.restricao_saude_2017
 br_ibge_pof.servico_nao_monetario_pof2_2017
 br_ibge_pof.servico_nao_monetario_pof4_2017
 br_ieps_saude.brasil
 br_ieps_saude.macrorregiao
 br_ieps_saude.municipio
 br_ieps_saude.regiao_saude
 br_ieps_saude.uf
 br_imprensa_nacional_dou.secao_1
 br_imprensa_nacional_dou.secao_2
 br_imprensa_nacional_dou.secao_3
 br_ipea_acesso_oportunidades.estatisticas_2019
 br_ipea_acesso_oportunidades.indicadores_2019
 br_mapbiomas_estatisticas.classe
 br_mapbiomas_estatisticas.cobertura_municipio_classe
 br_mapbiomas_estatisticas.cobertura_uf_classe
 br_mapbiomas_estatisticas.transicao_municipio_de_para_anual
 br_mapbiomas_estatisticas.transicao_municipio_de_para_decenal
 br_mapbiomas_estatisticas.transicao_municipio_de_para_quinquenal
 br_mapbiomas_estatisticas.transicao_uf_de_para_anual
 br_mapbiomas_estatisticas.transicao_uf_de_para_decenal
 br_mapbiomas_estatisticas.transicao_uf_de_para_quinquenal
 br_mc_indicadores.transferencias_municipio
 br_me_caged.microdados_antigos
 br_me_caged.microdados_antigos_ajustes
 br_me_clima_organizacional.microdados
 br_me_cno.microdados_cnae
 br_me_cno.microdados_vinculo
 br_me_estoque_divida_publica.microdados
 br_me_exportadoras_importadoras.dicionario
 br_me_exportadoras_importadoras.estabelecimentos
 br_me_pensionistas.microdados
 br_me_siape.servidores_executivo_federal
 br_me_siorg.remuneracao
 br_mec_prouni.microdados
 br_mma_extincao.fauna_ameacada
 br_mma_extincao.flora_ameacada
 br_mobilidados_indicadores.comprometimento_renda_tarifa_transp_publico
 br_mobilidados_indicadores.divisao_modal
 br_mobilidados_indicadores.emissao_co2_material_particulado
 br_mobilidados_indicadores.proporcao_domicilios_infra_urbana
 br_mobilidados_indicadores.proporcao_mortes_negras_acidente_transporte
 br_mobilidados_indicadores.proporcao_pessoas_prox_infra_cicloviaria
 br_mobilidados_indicadores.proporcao_pessoas_proximas_pnt
 br_mobilidados_indicadores.taxa_motorizacao
 br_mobilidados_indicadores.tempo_deslocamento_casa_trabalho
 br_mobilidados_indicadores.transporte_media_alta_capacidade
 br_ms_atencao_basica.municipio
 br_ms_imunizacoes.municipio
 br_ms_sim.municipio
 br_ms_sim.municipio_causa
 br_ms_sim.municipio_causa_idade
 br_ms_sim.municipio_causa_idade_sexo_raca
 br_ms_sinan.microdados_violencia
 br_ms_vacinacao_covid19.microdados
 br_ms_vacinacao_covid19.microdados_estabelecimento
 br_ms_vacinacao_covid19.microdados_paciente
 br_ms_vacinacao_covid19.microdados_vacinacao
 br_ons_energia_armazenada.subsistemas
 br_rj_rio_de_janeiro_ipp_ips.dimensoes_componentes
 br_rj_rio_de_janeiro_ipp_ips.indicadores
 br_rj_tce_iegm.indicadores
 br_seeg_emissoes.brasil
 br_senado_cpipandemia.discursos
 br_sgp_informacao.despesas_cartao_corporativo
 br_sp_alesp.assessores_lideranca
 br_sp_alesp.assessores_parlamentares
 br_sp_alesp.deputado
 br_sp_alesp.despesas_gabinete
 br_sp_alesp.despesas_gabinete_atual
 br_sp_gov_orcamento.despesa
 br_sp_gov_orcamento.receita_arrecadada
 br_sp_gov_orcamento.receita_prevista
 br_sp_gov_ssp.ocorrencias_registradas
 br_sp_gov_ssp.produtividade_policial
 br_sp_saopaulo_dieese_icv.ano
 br_sp_seduc_fluxo_escolar.escola
 br_sp_seduc_fluxo_escolar.municipio
 br_sp_seduc_idesp.diretoria
 br_sp_seduc_idesp.escola
 br_sp_seduc_idesp.uf
 br_sp_seduc_inse.escola
 br_tpe_classificacao_saeb.categoria
 br_tse_eleicoes.local_secao
 eu_fra_lgbt.consciencia_direitos
 eu_fra_lgbt.cotidiano
 eu_fra_lgbt.discriminacao
 eu_fra_lgbt.especifico_transgenero
 eu_fra_lgbt.violencia_abuso
 mundo_bm_learning_poverty.pais
 mundo_kaggle_olimpiadas.microdados
 mundo_onu_adh.brasil
 mundo_onu_adh.municipio
 mundo_onu_adh.uf
 mundo_transrespect_transphobia.causa_obito
 mundo_transrespect_transphobia.local
 mundo_transrespect_transphobia.pais
 nl_ug_pwt.microdados
 world_fao_production.country_group
 world_fao_production.crop_livestock
 world_fao_production.dictionary
 world_fao_production.element
 world_fao_production.item
 world_fao_production.item_group
 world_fao_production.production_indices
 world_fao_production.value_agricultural_production
 world_fifa_women_world_cup.matches
 world_fifa_worldcup.award_winners
 world_fifa_worldcup.matches
 world_fifa_worldcup.players
 world_fifa_worldcup.teams
 world_fifa_worldcup.tournaments
 world_gsps_consortium_gsps.global_indicators
 world_oecd_pisa.dictionary
 world_oecd_pisa.school_summary
 world_oecd_pisa.student_summary
 world_slave_voyages_consortium_slave_trade.transatlantic
 world_spi_spi.global_indicators
 world_ti_corruption_perception.country
 world_wb_wwbi.country_finance
 world_wb_wwbi.country_indicators
 br_anatel_banda_larga_fixa.backhaul
 br_anatel_banda_larga_fixa.pble
 br_inep_ana.aluno
 br_inep_ana.escola
 br_inep_ana.prova
 br_inep_censo_escolar.docente
 br_inep_censo_escolar.matricula
 br_inep_formacao_docente.brasil
 br_inep_formacao_docente.escola
 br_inep_formacao_docente.municipio
 br_inep_formacao_docente.regiao
 br_inep_formacao_docente.uf
 br_inep_indicador_nivel_socioeconomico.brasil
 br_inep_indicador_nivel_socioeconomico.municipio
 br_inep_indicador_nivel_socioeconomico.uf
 br_inep_indicadores_educacionais.escola_nivel_socioeconomico
 br_inep_indicadores_educacionais.fluxo_educacao_superior
 br_inmet_bdmep.estacao
		`@@ -0,0 +1,2 @@`
							`br_bcb_taxa_cambio.taxa_cambio`
							`br_bcb_taxa_selic.taxa_selic`