refactor: reorganize project structure and fix broken references

- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.)
- Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml)
- Move basedosdados.duckdb to data/ directory
- Update Dockerfile and start.sh with new file paths
- Update README.md with correct script paths
- Remove Python ask.py (replaced by Rust binary in ask/ask)
- Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs)
- Remove sentence-transformer dependencies from ask
- Move docs and context artifacts to their directories
This commit is contained in:
2026-03-29 20:46:27 +02:00
parent 02cb13362c
commit ed5fa6756e
43 changed files with 302366 additions and 1093 deletions

3
.gitignore vendored
View File

@@ -3,6 +3,5 @@
logs/ logs/
done_tables.txt done_tables.txt
done_transfers.txt done_transfers.txt
# CocoIndex Code (ccc)
/.cocoindex_code/
**/target **/target
*.log

View File

@@ -28,8 +28,9 @@ ENV PATH="/root/.cargo/bin:${PATH}" \
WORKDIR /app WORKDIR /app
COPY basedosdados.duckdb Caddyfile start.sh auth.py ask.py ./ COPY data/basedosdados.duckdb shell/Caddyfile shell/auth.py start.sh ./
RUN chmod +x start.sh COPY ask/ask /app/ask
RUN chmod +x start.sh /app/ask
EXPOSE 8080 EXPOSE 8080

127
README.md
View File

@@ -8,11 +8,13 @@ Os dados foram exportados do BigQuery para o Hetzner Object Storage (Helsinki) n
## Consultando os dados ## Consultando os dados
Acesso via browser ou curl, protegido por senha. Peça a senha para o administrador. Acesso via browser ou curl, protegido por senha - peça!
### Shell no browser ### Shell no browser
Acesse **https://db.xn--2dk.xyz** → autentique → shell DuckDB interativo direto no browser. Acesse **https://db..xyz** → autentique → shell DuckDB interativo direto no browser.
Use `.tables` para listar os datasets.
### SQL via curl ### SQL via curl
@@ -46,35 +48,6 @@ curl -s -X POST https://db.xn--2dk.xyz/query \
--data-binary @query.sql > resultado.csv --data-binary @query.sql > resultado.csv
``` ```
### Descobrindo tabelas
```sql
-- listar todos os datasets (schemas)
SHOW SCHEMAS;
-- listar tabelas de um dataset
SHOW TABLES IN br_anatel_banda_larga_fixa;
-- ver colunas de uma tabela
DESCRIBE br_anatel_banda_larga_fixa.densidade_brasil;
```
No shell do browser, `.tables` lista tudo de uma vez.
### Exportar em CSV ou JSON
O DuckDB permite formatar a saída diretamente na query:
```sql
-- CSV com header (pipe para arquivo via curl)
COPY (SELECT * FROM br_ibge_censo2022.municipios LIMIT 1000)
TO '/dev/stdout' (FORMAT csv, HEADER true);
-- JSON
SELECT * FROM br_ibge_censo2022.municipios LIMIT 10
FORMAT JSON;
```
--- ---
## Exploração local ## Exploração local
@@ -82,11 +55,11 @@ FORMAT JSON;
Para rodar as queries na sua própria máquina com DuckDB instalado: Para rodar as queries na sua própria máquina com DuckDB instalado:
```bash ```bash
python prepara_db.py # gera basedosdados.duckdb com views apontando para o S3 duckdb data/basedosdados.duckdb
duckdb basedosdados.duckdb
``` ```
As queries são executadas diretamente sobre os arquivos Parquet no S3 — não há download de dados. O DuckDB lê os arquivos remotos sob demanda via `httpfs`. As queries são executadas diretamente sobre os arquivos Parquet no S3 — não há download de dados. O DuckDB lê os arquivos remotos sob demanda via `httpfs`.
Precisa da credencial da .env - peça!
--- ---
@@ -94,62 +67,52 @@ As queries são executadas diretamente sobre os arquivos Parquet no S3 — não
Interface TUI que permite fazer perguntas em português e obter SQL automaticamente. Interface TUI que permite fazer perguntas em português e obter SQL automaticamente.
### Arquitetura
```
Pergunta → [schema filtrado] → LLM local (sqlcoder) ou API externa
→ SQL
```
1. **Schema filtrado**: As tabelas relevantes são filtradas e enviadas ao LLM
2. **Geração SQL**: Modelo local (sqlcoder via Ollama) ou API externa (Gemini/OpenRouter)
### No browser ### No browser
Acesse **https://ask.xn--2dk.xyz** → autentique → digite sua pergunta em português. Acesse **https://ask..xyz** → autentique → digite sua pergunta em português.
### Local ### Local
```bash ```bash
# Compilar
cd ask cd ask
cargo build --release cargo build --release
./target/release/ask # modo interativo
./target/release/ask "Quantos municípios tem SP?" # modo CLI # Modo interativo (TUI)
./target/release/ask
# Modo CLI
./target/release/ask "Quantos municípios tem SP?"
``` ```
### Variáveis de ambiente ### Variáveis de ambiente
| Variável | Descrição | | Variável | Padrão | Descrição |
|---|---| |---|---|---|
| `GEMINI_API_KEY` | Chave da API Gemini (obrigatória para usar modelos Gemini) | | `SQL_GENERATOR` | `gemini` | Generator: `sqlcoder`, `gemini`, ou `openrouter` |
| `OPENROUTER_API_KEY` | Chave para usar modelos via OpenRouter | | `GEMINI_API_KEY` | — | Chave API Gemini (obrigatória se usar gemini) |
| `GEMINI_MODEL` | Modelo a usar (padrão: `gemini-flash-latest`) | | `OPENROUTER_API_KEY` | — | Chave API OpenRouter (obrigatória se usar openrouter) |
| `SCHEMA_FILE` | Arquivo de schema (padrão: `context/schema_compact_inline.txt`) | | `GEMINI_MODEL` | `gemini-flash-lash` | Modelo Gemini |
| `DB_FILE` | Arquivo DuckDB (padrão: `basedosdados.duckdb`) | | `OPENROUTER_MODEL` | `openai/gpt-4o-mini` | Modelo OpenRouter |
| `OLLAMA_MODEL` | `sqlcoder` | Modelo Ollama (sqlcoder ou sqlcoder:14b) |
| `OLLAMA_HOST` | `http://localhost:11434` | Host Ollama |
| `TOP_K_TABLES` | `5` | Número de tabelas a selecionar |
| `SCHEMA_FILE` | `context/schema_compact_inline.txt` | Schema texto para fallback |
| `SCHEMA_JSON` | `context/basedosdados-schema.json` | Schema JSON completo |
| `DB_FILE` | `data/basedosdados.duckdb` | Arquivo DuckDB |
--- ---
## Arquivos de schema
O diretório `context/` contém artefatos gerados automaticamente para contexto do LLM e descoberta de tabelas:
| Arquivo | Descrição |
|---|---|
| `schema_compact_inline.txt` | Schema condensado para contexto do LLM |
| `schema_compact.txt` | Schema mais verboso |
| `schema_ddl.sql` | DDL das views DuckDB |
| `join_graph.json` | Relacionamentos entre tabelas |
| `file_tree.md` | Estrutura de arquivos no S3 com tamanhos |
| `schemas.json` | Schema raw do BigQuery |
---
## Descobrindo tabelas
```sql
-- listar todos os datasets (schemas)
SHOW SCHEMAS;
-- listar tabelas de um dataset
SHOW TABLES IN br_anatel_banda_larga_fixa;
-- ver colunas de uma tabela
DESCRIBE br_anatel_banda_larga_fixa.densidade_brasil;
```
No shell do browser, `.tables` lista tudo de uma vez. Para descoberta programática, use os arquivos em `context/`.
---
## Pipeline de exportação ## Pipeline de exportação
@@ -172,8 +135,8 @@ Resume automático: se interrompido, basta rodar novamente.
| Script | Função | | Script | Função |
|---|---| |---|---|
| `roda.sh` | Pipeline principal de exportação | | `scripts/roda.sh` | Pipeline principal de exportação |
| `prepara_db.py` | Gera `basedosdados.duckdb` com views para todas as tabelas | | `scripts/prepara_db.py` | Gera `data/basedosdados.duckdb` com views para todas as tabelas |
### Configuração (`.env`) ### Configuração (`.env`)
@@ -196,10 +159,10 @@ Resume automático: se interrompido, basta rodar novamente.
### Executando ### Executando
```bash ```bash
chmod +x roda.sh chmod +x scripts/roda.sh
./roda.sh --dry-run # estima tamanho e custo ./scripts/roda.sh --dry-run # estima tamanho e custo
./roda.sh # execução local ./scripts/roda.sh # execução local
./roda.sh --gcloud-run # cria VM no GCP, roda lá e deleta ao final ./scripts/roda.sh --gcloud-run # cria VM no GCP, roda lá e deleta ao final
``` ```
Autenticação GCP necessária antes da primeira exportação: Autenticação GCP necessária antes da primeira exportação:
@@ -219,8 +182,8 @@ Cria uma VM `e2-standard-4` Debian 12 em `us-central1-a`, copia o script e o `.e
| `GCP_VM_NAME` | `bd-export-vm` | Nome da instância | | `GCP_VM_NAME` | `bd-export-vm` | Nome da instância |
| `GCP_VM_ZONE` | `us-central1-a` | Zona do Compute Engine | | `GCP_VM_ZONE` | `us-central1-a` | Zona do Compute Engine |
### Deploy do servidor ### Deploy do servidor para serviços de db e ask
```bash ```bash
haloy deploy haloy deploy -f shell/haloy.yml
``` ```

129
ask.py
View File

@@ -1,129 +0,0 @@
#!/usr/bin/env python3
"""
ask.py — Send a Portuguese question to Gemini and get back SQL.
Usage:
python ask.py "Quantos pedidos foram feitos por cliente no último mês?"
python ask.py "Qual a taxa de mortalidade infantil por município em 2020?"
Env vars:
GEMINI_API_KEY — required
SCHEMA_FILE — path to DDL file (default: context/schema_compact_inline.txt)
GEMINI_MODEL — model slug (default: gemini-2.0-flash-latest)
"""
import os
import sys
import json
import requests
import duckdb
from dotenv import load_dotenv
load_dotenv()
SCHEMA_FILE = os.getenv("SCHEMA_FILE", "context/schema_compact_inline.txt")
MODEL = os.getenv("GEMINI_MODEL", "gemini-flash-latest")
DB_FILE = os.getenv("DB_FILE", "basedosdados.duckdb")
def load_schema(path: str) -> str:
with open(path, "r", encoding="utf-8") as f:
return f.read()
def ask(question: str) -> str:
api_key = os.getenv("GEMINI_API_KEY")
if not api_key:
sys.exit("Error: GEMINI_API_KEY not set")
schema_ddl = load_schema(SCHEMA_FILE)
system_prompt = (
"You are a SQL expert for Base dos Dados (basedosdados.org), "
"a Brazilian open data warehouse with tables accessed via DuckDB.\n\n"
"Rules:\n"
"- Use DuckDB syntax. Tables are referenced as dataset.table.\n"
"- Only use columns from the provided DDL — never invent column names.\n"
"- Add WHERE filters on ano, sigla_uf, or id_municipio whenever possible.\n"
"- Return ONLY the SQL query, no explanation, no markdown fences.\n\n"
f"Schema DDL:\n\n{schema_ddl}"
)
url = (
f"https://generativelanguage.googleapis.com/v1beta/models"
f"/{MODEL}:generateContent"
)
payload = {
"system_instruction": {
"parts": [{"text": system_prompt}]
},
"contents": [
{
"parts": [{"text": question}]
}
]
}
response = requests.post(
url,
headers={
"Content-Type": "application/json",
"X-goog-api-key": api_key,
},
data=json.dumps(payload),
timeout=300,
)
response.raise_for_status()
result = response.json()
return result["candidates"][0]["content"]["parts"][0]["text"].strip()
def main():
if len(sys.argv) < 2:
print(f"Usage: python {sys.argv[0]} \"<pergunta em português>\"", file=sys.stderr)
sys.exit(1)
question = " ".join(sys.argv[1:])
print(f"Question: {question}\n", file=sys.stderr)
print(f"Model: {MODEL}\n", file=sys.stderr)
sql = ask(question)
print(f"\n── SQL ──────────────────────────────────────────\n{sql}\n", file=sys.stderr)
con = duckdb.connect(DB_FILE, read_only=True)
rel = con.sql(sql)
# box mode: build borders from column names + data
cols = rel.columns
rows = rel.fetchall()
if not rows:
print("(no rows returned)")
return
col_widths = [len(c) for c in cols]
for row in rows:
for i, val in enumerate(row):
col_widths[i] = max(col_widths[i], len(str(val) if val is not None else "NULL"))
def bar(left, mid, right, fill=""):
return left + mid.join(fill * (w + 2) for w in col_widths) + right
header = "" + "".join(f" {c:{w}} " for c, w in zip(cols, col_widths)) + ""
print(bar("", "", ""))
print(header)
print(bar("", "", ""))
for row in rows:
vals = [str(v) if v is not None else "NULL" for v in row]
print("" + "".join(f" {v:{w}} " for v, w in zip(vals, col_widths)) + "")
print(bar("", "", ""))
print(f"\n{len(rows)} row(s)")
if __name__ == "__main__":
main()

1
ask/.dockerignore Normal file
View File

@@ -0,0 +1 @@
target

1
ask/Cargo.lock generated
View File

@@ -252,6 +252,7 @@ dependencies = [
"duckdb", "duckdb",
"ratatui", "ratatui",
"reqwest", "reqwest",
"serde",
"serde_json", "serde_json",
"syntect", "syntect",
"tui-textarea", "tui-textarea",

View File

@@ -9,6 +9,7 @@ path = "src/main.rs"
[dependencies] [dependencies]
reqwest = { version = "0.12", features = ["blocking", "rustls-tls", "json"], default-features = false } reqwest = { version = "0.12", features = ["blocking", "rustls-tls", "json"], default-features = false }
serde = { version = "1", features = ["derive"] }
serde_json = "1" serde_json = "1"
duckdb = { version = "1", features = ["bundled"] } duckdb = { version = "1", features = ["bundled"] }
dotenvy = "0.15" dotenvy = "0.15"

BIN
ask/ask Executable file

Binary file not shown.

View File

@@ -1,4 +1,9 @@
mod schema_filter;
mod sql_generator;
mod table_selector;
use anyhow::{Context, Result}; use anyhow::{Context, Result};
use chrono::Utc;
use crossterm::{ use crossterm::{
event::{ event::{
DisableBracketedPaste, DisableMouseCapture, EnableBracketedPaste, EnableMouseCapture, DisableBracketedPaste, DisableMouseCapture, EnableBracketedPaste, EnableMouseCapture,
@@ -9,14 +14,12 @@ use crossterm::{
}; };
use duckdb::Connection; use duckdb::Connection;
use ratatui::{ use ratatui::{
buffer::Buffer,
layout::{Constraint, Direction, Layout, Rect}, layout::{Constraint, Direction, Layout, Rect},
style::{Color, Modifier, Style}, style::{Color, Modifier, Style},
text::{Line, Span}, text::{Line, Span},
widgets::{Block, Borders, Gauge, Paragraph, Row, Table, TableState, Wrap}, widgets::{Block, Borders, Gauge, Paragraph, Row, Table, TableState, Wrap},
Frame, Terminal, Frame, Terminal,
}; };
use chrono::Utc;
use serde_json::{json, Value}; use serde_json::{json, Value};
use std::{ use std::{
env, fs, env, fs,
@@ -43,6 +46,10 @@ struct Config {
schema: String, schema: String,
db_file: String, db_file: String,
prompt_file: String, prompt_file: String,
use_table_selection: bool,
embeddings_file: String,
schema_json: String,
similarity_threshold: f32,
} }
enum Phase { enum Phase {
@@ -234,10 +241,23 @@ fn spawn_worker(
model: String, model: String,
prompt_file: String, prompt_file: String,
db_file: String, db_file: String,
use_table_selection: bool,
embeddings_file: String,
schema_json: String,
similarity_threshold: f32,
) -> mpsc::Receiver<WorkerMsg> { ) -> mpsc::Receiver<WorkerMsg> {
let (tx, rx) = mpsc::channel::<WorkerMsg>(); let (tx, rx) = mpsc::channel::<WorkerMsg>();
std::thread::spawn( std::thread::spawn(move || {
move || match ask_model(&question, &schema, &model, &prompt_file) { match ask_model_with_selection(
&question,
&schema,
&model,
&prompt_file,
use_table_selection,
&embeddings_file,
&schema_json,
similarity_threshold,
) {
Err(e) => { Err(e) => {
let err = format!("{:#}", e); let err = format!("{:#}", e);
log_question(&question, "", false, Some(&err)); log_question(&question, "", false, Some(&err));
@@ -257,8 +277,8 @@ fn spawn_worker(
} }
} }
} }
}, }
); });
rx rx
} }
@@ -270,6 +290,10 @@ fn spawn_retry_worker(
model: String, model: String,
prompt_file: String, prompt_file: String,
db_file: String, db_file: String,
use_table_selection: bool,
embeddings_file: String,
schema_json: String,
similarity_threshold: f32,
) -> mpsc::Receiver<WorkerMsg> { ) -> mpsc::Receiver<WorkerMsg> {
let retry_q = format!( let retry_q = format!(
"{}\n\nO SQL que você gerou falhou com este erro DuckDB:\n```\n{}\n```\n\n\ "{}\n\nO SQL que você gerou falhou com este erro DuckDB:\n```\n{}\n```\n\n\
@@ -277,7 +301,17 @@ fn spawn_retry_worker(
Corrija o SQL. Retorne APENAS o SQL corrigido, sem explicação.", Corrija o SQL. Retorne APENAS o SQL corrigido, sem explicação.",
question, error, failed_sql question, error, failed_sql
); );
spawn_worker(retry_q, schema, model, prompt_file, db_file) spawn_worker(
retry_q,
schema,
model,
prompt_file,
db_file,
use_table_selection,
embeddings_file,
schema_json,
similarity_threshold,
)
} }
// ── event handling ──────────────────────────────────────────────────────────── // ── event handling ────────────────────────────────────────────────────────────
@@ -327,6 +361,10 @@ impl App {
self.config.model.clone(), self.config.model.clone(),
self.config.prompt_file.clone(), self.config.prompt_file.clone(),
self.config.db_file.clone(), self.config.db_file.clone(),
self.config.use_table_selection,
self.config.embeddings_file.clone(),
self.config.schema_json.clone(),
self.config.similarity_threshold,
)); ));
} }
@@ -398,6 +436,10 @@ impl App {
self.config.model.clone(), self.config.model.clone(),
self.config.prompt_file.clone(), self.config.prompt_file.clone(),
self.config.db_file.clone(), self.config.db_file.clone(),
self.config.use_table_selection,
self.config.embeddings_file.clone(),
self.config.schema_json.clone(),
self.config.similarity_threshold,
)); ));
self.last_sql.clear(); self.last_sql.clear();
} else { } else {
@@ -723,7 +765,12 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
let col_max_widths: Vec<usize> = (0..col_count) let col_max_widths: Vec<usize> = (0..col_count)
.map(|i| { .map(|i| {
let header_len = cols[i].len(); let header_len = cols[i].len();
let data_len = rows.iter().filter_map(|r| r.get(i)).map(|c| c.len()).max().unwrap_or(0); let data_len = rows
.iter()
.filter_map(|r| r.get(i))
.map(|c| c.len())
.max()
.unwrap_or(0);
(header_len.max(data_len)).max(min_col_width as usize) (header_len.max(data_len)).max(min_col_width as usize)
}) })
.collect(); .collect();
@@ -732,16 +779,24 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
let use_wrap = total_needed > available_width as usize; let use_wrap = total_needed > available_width as usize;
if use_wrap { if use_wrap {
let wrap_width = (available_width as usize / col_count).max(min_col_width as usize); let wrap_width =
let header_lines: Vec<Line> = cols.iter() (available_width as usize / col_count).max(min_col_width as usize);
let header_lines: Vec<Line> = cols
.iter()
.enumerate() .enumerate()
.map(|(i, c)| { .map(|(i, c)| {
let wrapped = wrap_text(c, wrap_width); let wrapped = wrap_text(c, wrap_width);
Line::from(wrapped) let spans: Vec<Span> =
wrapped.into_iter().map(|s| Span::raw(s)).collect();
Line::from(spans)
}) })
.collect(); .collect();
let max_header_lines = header_lines.iter().map(|l| l.len()).max().unwrap_or(1); let max_header_lines = header_lines
.iter()
.map(|l| l.spans.len())
.max()
.unwrap_or(1);
let mut all_row_lines: Vec<Vec<Line>> = Vec::new(); let mut all_row_lines: Vec<Vec<Line>> = Vec::new();
for row in rows { for row in rows {
@@ -749,19 +804,19 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
.map(|i| { .map(|i| {
let cell = row.get(i).map(|s| s.as_str()).unwrap_or(""); let cell = row.get(i).map(|s| s.as_str()).unwrap_or("");
let wrapped = wrap_text(cell, wrap_width); let wrapped = wrap_text(cell, wrap_width);
Line::from(wrapped) let spans: Vec<Span> =
wrapped.into_iter().map(|s| Span::raw(s)).collect();
Line::from(spans)
}) })
.collect(); .collect();
let max_lines = row_lines.iter().map(|l| l.len()).max().unwrap_or(1); let max_lines = row_lines.iter().map(|l| l.spans.len()).max().unwrap_or(1);
all_row_lines.push(row_lines); all_row_lines.push(row_lines);
} }
let selected_idx = table_state.selected().unwrap_or(0); let selected_idx = table_state.selected().unwrap_or(0);
let table_title = format!(" Resultados ({}/{}) ", selected_idx + 1, n); let table_title = format!(" Resultados ({}/{}) ", selected_idx + 1, n);
let block = Block::default() let block = Block::default().borders(Borders::ALL).title(table_title);
.borders(Borders::ALL)
.title(table_title);
let area = chunks[2]; let area = chunks[2];
f.render_widget(block, area); f.render_widget(block, area);
@@ -778,29 +833,32 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
let start_row = if n > visible_rows as usize { let start_row = if n > visible_rows as usize {
let scroll = selected_idx as i32 - visible_rows as i32 / 2; let scroll = selected_idx as i32 - visible_rows as i32 / 2;
scroll.max(0) as usize.min(n.saturating_sub(visible_rows as usize)) (scroll.max(0) as usize).min(n.saturating_sub(visible_rows as usize))
} else { } else {
0 0
}; };
let header_bg = Style::default().fg(Color::Yellow).add_modifier(Modifier::BOLD); let header_bg = Style::default()
.fg(Color::Yellow)
.add_modifier(Modifier::BOLD);
for (col_idx, header_line) in header_lines.iter().enumerate() { for (col_idx, header_line) in header_lines.iter().enumerate() {
let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1); let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1);
let col_width = wrap_width as u16; let col_width = wrap_width as u16;
for (line_idx, line) in header_line.iter().enumerate() { for (line_idx, span) in header_line.spans.iter().enumerate() {
let y = inner_area.y + line_idx as u16; let y = inner_area.y + line_idx as u16;
if y >= inner_area.y + inner_area.height { if y >= inner_area.y + inner_area.height {
break; break;
} }
let spans: Vec<Span> = line.spans.iter().map(|s| { let styled_span = Span::styled(span.content.clone(), header_bg);
Span::styled(s.content.clone(), header_bg) f.render_widget(
}).collect(); Paragraph::new(Line::from(styled_span)),
f.render_widget(Paragraph::new(Line::from(spans)), Rect { Rect {
x: col_x, x: col_x,
y, y,
width: col_width, width: col_width,
height: 1, height: 1,
}); },
);
} }
} }
@@ -811,7 +869,9 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
} }
let is_selected = row_idx == selected_idx; let is_selected = row_idx == selected_idx;
let row_style = if is_selected { let row_style = if is_selected {
Style::default().bg(Color::DarkGray).add_modifier(Modifier::BOLD) Style::default()
.bg(Color::DarkGray)
.add_modifier(Modifier::BOLD)
} else { } else {
Style::default() Style::default()
}; };
@@ -820,20 +880,21 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
for (col_idx, cell_lines) in row_lines.iter().enumerate() { for (col_idx, cell_lines) in row_lines.iter().enumerate() {
let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1); let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1);
let col_width = wrap_width as u16; let col_width = wrap_width as u16;
for (line_idx, line) in cell_lines.iter().enumerate() { for (line_idx, span) in cell_lines.spans.iter().enumerate() {
let cell_y = y + line_idx as u16; let cell_y = y + line_idx as u16;
if cell_y >= inner_area.y + inner_area.height { if cell_y >= inner_area.y + inner_area.height {
break; break;
} }
let spans: Vec<Span> = line.spans.iter().map(|s| { let styled_span = Span::styled(span.content.clone(), row_style);
Span::styled(s.content.clone(), row_style) f.render_widget(
}).collect(); Paragraph::new(Line::from(styled_span)),
f.render_widget(Paragraph::new(Line::from(spans)), Rect { Rect {
x: col_x, x: col_x,
y: cell_y, y: cell_y,
width: col_width, width: col_width,
height: 1, height: 1,
}); },
);
} }
} }
@@ -850,7 +911,8 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
} }
} }
} else { } else {
let col_widths: Vec<Constraint> = cols.iter() let col_widths: Vec<Constraint> = cols
.iter()
.enumerate() .enumerate()
.map(|(i, _)| { .map(|(i, _)| {
let w = col_max_widths[i] as u16; let w = col_max_widths[i] as u16;
@@ -1008,6 +1070,55 @@ fn ask_model(question: &str, schema: &str, model: &str, prompt_file: &str) -> Re
Ok(ensure_sql(&sql)) Ok(ensure_sql(&sql))
} }
fn ask_model_with_selection(
question: &str,
_full_schema: &str,
model: &str,
prompt_file: &str,
use_selection: bool,
embeddings_file: &str,
schema_json: &str,
similarity_threshold: f32,
) -> Result<String> {
let prompt_template = fs::read_to_string(prompt_file)
.with_context(|| format!("Não foi possível ler o prompt: {}", prompt_file))?;
let (schema_to_use, selected_tables) = if use_selection {
match table_selector::select_tables_from_question(
question,
embeddings_file,
similarity_threshold,
) {
Ok(table_ids) => {
eprintln!(
"=> Selecionadas {} tables relevantes: {:?}",
table_ids.len(),
table_ids
);
let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
let filtered_schema = schema_filter.filter_tables(&table_ids);
(filtered_schema, Some(table_ids))
}
Err(e) => {
eprintln!(
"=> Aviso: falha na seleção de tables ({}), usando schema completo",
e
);
let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
(schema_filter.full_schema_text(), None)
}
}
} else {
let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
(schema_filter.full_schema_text(), None)
};
let generator = sql_generator::create_sql_generator()?;
let sql = generator.generate(question, &schema_to_use, &prompt_template)?;
Ok(ensure_sql(&sql))
}
fn ask_gemini(question: &str, system_prompt: &str, model: &str) -> Result<String> { fn ask_gemini(question: &str, system_prompt: &str, model: &str) -> Result<String> {
let key = env::var("GEMINI_API_KEY").context("GEMINI_API_KEY não definida")?; let key = env::var("GEMINI_API_KEY").context("GEMINI_API_KEY não definida")?;
let url = format!( let url = format!(
@@ -1309,6 +1420,12 @@ VARIÁVEIS DE AMBIENTE
OPENROUTER_API_KEY necessária para modelos OpenRouter OPENROUTER_API_KEY necessária para modelos OpenRouter
GEMINI_MODEL modelo padrão (sobrescrito por --model) GEMINI_MODEL modelo padrão (sobrescrito por --model)
SCHEMA_FILE DDL do schema [context/schema_compact_inline.txt] SCHEMA_FILE DDL do schema [context/schema_compact_inline.txt]
SCHEMA_JSON full schema JSON [context/basedosdados-schema.json]
EMBEDDINGS_FILE table embeddings [context/table_embeddings.json]
TOP_K_TABLES número de tables a selecionar [5]
SQL_GENERATOR sql generator: sqlcoder|gemini|openrouter [gemini]
OLLAMA_MODEL modelo ollama [sqlcoder]
OLLAMA_HOST host ollama [http://localhost:11434]
PROMPT_FILE prompt do sistema [ask/system_prompt.md] PROMPT_FILE prompt do sistema [ask/system_prompt.md]
DB_FILE banco DuckDB [basedosdados.duckdb] DB_FILE banco DuckDB [basedosdados.duckdb]
"# "#
@@ -1321,7 +1438,18 @@ VARIÁVEIS DE AMBIENTE
}); });
let schema_file = let schema_file =
env::var("SCHEMA_FILE").unwrap_or_else(|_| "context/schema_compact_inline.txt".into()); env::var("SCHEMA_FILE").unwrap_or_else(|_| "context/schema_compact_inline.txt".into());
let db_file = env::var("DB_FILE").unwrap_or_else(|_| "basedosdados.duckdb".into()); let schema_json =
env::var("SCHEMA_JSON").unwrap_or_else(|_| "context/basedosdados-schema.json".into());
let embeddings_file =
env::var("EMBEDDINGS_FILE").unwrap_or_else(|_| "context/table_embeddings.json".into());
let similarity_threshold = env::var("SIMILARITY_THRESHOLD")
.ok()
.and_then(|v| v.parse().ok())
.unwrap_or(0.35);
let use_table_selection = env::var("USE_TABLE_SELECTION")
.map(|v| v != "false" && v != "0")
.unwrap_or(true);
let db_file = env::var("DB_FILE").unwrap_or_else(|_| "data/basedosdados.duckdb".into());
let prompt_file = env::var("PROMPT_FILE").unwrap_or_else(|_| "ask/system_prompt.md".into()); let prompt_file = env::var("PROMPT_FILE").unwrap_or_else(|_| "ask/system_prompt.md".into());
let schema = fs::read_to_string(&schema_file) let schema = fs::read_to_string(&schema_file)
.with_context(|| format!("Não foi possível ler o schema: {}", schema_file))?; .with_context(|| format!("Não foi possível ler o schema: {}", schema_file))?;
@@ -1333,6 +1461,10 @@ VARIÁVEIS DE AMBIENTE
schema, schema,
db_file, db_file,
prompt_file, prompt_file,
use_table_selection,
embeddings_file,
schema_json,
similarity_threshold,
}); });
} }
@@ -1341,7 +1473,16 @@ VARIÁVEIS DE AMBIENTE
eprintln!("\nModel: {}\nPergunta: {}\n", model, question); eprintln!("\nModel: {}\nPergunta: {}\n", model, question);
let t0 = Instant::now(); let t0 = Instant::now();
let sql = ask_model(&question, &schema, &model, &prompt_file)?; let sql = ask_model_with_selection(
&question,
&schema,
&model,
&prompt_file,
use_table_selection,
&embeddings_file,
&schema_json,
similarity_threshold,
)?;
eprintln!("=> SQL gerado em {}", fmt_duration(t0.elapsed())); eprintln!("=> SQL gerado em {}", fmt_duration(t0.elapsed()));
print_sql_box(&sql); print_sql_box(&sql);

135
ask/src/schema_filter.rs Normal file
View File

@@ -0,0 +1,135 @@
use serde::{Deserialize, Serialize};
use std::collections::HashSet;
use std::fs;
use std::path::Path;
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct Column {
pub name: String,
#[serde(rename = "type")]
pub col_type: String,
pub description: Option<String>,
}
pub type TableColumns = Vec<Column>;
#[derive(Debug, Clone, Deserialize)]
pub struct FullSchema {
#[serde(flatten)]
pub datasets:
std::collections::HashMap<String, std::collections::HashMap<String, TableColumns>>,
}
pub struct SchemaFilter {
schema: FullSchema,
}
impl SchemaFilter {
pub fn new<P: AsRef<Path>>(schema_path: P) -> anyhow::Result<Self> {
let content = fs::read_to_string(schema_path)?;
let schema: FullSchema = serde_json::from_str(&content)?;
Ok(Self { schema })
}
pub fn filter_tables(&self, table_ids: &[String]) -> String {
let selected: HashSet<String> = table_ids.iter().cloned().collect();
let mut lines = Vec::new();
lines.push("# Base dos Dados — Filtered Schema".to_string());
lines.push(
"# Legend: V=VARCHAR I=INT D=DOUBLE Dt=DATE B=BOOLEAN Dec=DECIMAL Ts=TIMESTAMP Ti=TIME"
.to_string(),
);
lines.push("# Format: dataset.table: col:TYPE description".to_string());
lines.push(String::new());
for (dataset, tables) in &self.schema.datasets {
for (table, columns) in tables {
let full_id = format!("{}.{}", dataset, table);
if selected.contains(&full_id) {
let col_str = columns
.iter()
.map(|c| {
let desc = c.description.as_deref().unwrap_or("");
if desc.is_empty() {
format!("{}:{}", c.name, type_abbrev(&c.col_type))
} else {
format!("{}:{} {}", c.name, type_abbrev(&c.col_type), desc)
}
})
.collect::<Vec<_>>()
.join(" ");
lines.push(format!("{}: {}", full_id, col_str));
}
}
}
lines.join("\n")
}
pub fn full_schema_text(&self) -> String {
let mut lines = Vec::new();
lines.push("# Base dos Dados — Full Schema".to_string());
lines.push(
"# Legend: V=VARCHAR I=INT D=DOUBLE Dt=DATE B=BOOLEAN Dec=DECIMAL Ts=TIMESTAMP Ti=TIME"
.to_string(),
);
lines.push("# Format: dataset.table: col:TYPE description".to_string());
lines.push(String::new());
for (dataset, tables) in &self.schema.datasets {
for (table, columns) in tables {
let full_id = format!("{}.{}", dataset, table);
let col_str = columns
.iter()
.map(|c| {
let desc = c.description.as_deref().unwrap_or("");
if desc.is_empty() {
format!("{}:{}", c.name, type_abbrev(&c.col_type))
} else {
format!("{}:{} {}", c.name, type_abbrev(&c.col_type), desc)
}
})
.collect::<Vec<_>>()
.join(" ");
lines.push(format!("{}: {}", full_id, col_str));
}
}
lines.join("\n")
}
pub fn dataset_count(&self) -> usize {
self.schema.datasets.len()
}
pub fn table_count(&self) -> usize {
self.schema.datasets.values().map(|t| t.len()).sum()
}
}
fn type_abbrev(full_type: &str) -> String {
let upper = full_type.to_uppercase();
if upper.contains("VARCHAR") || upper.contains("STRING") {
"V".to_string()
} else if upper.contains("INT") {
"I".to_string()
} else if upper.contains("DOUBLE") || upper.contains("FLOAT") {
"D".to_string()
} else if upper.contains("DATE") && !upper.contains("TIMESTAMP") {
"Dt".to_string()
} else if upper.contains("TIMESTAMP") {
"Ts".to_string()
} else if upper.contains("TIME") {
"Ti".to_string()
} else if upper.contains("BOOLEAN") {
"B".to_string()
} else if upper.contains("DECIMAL") {
"Dec".to_string()
} else {
full_type.to_string()
}
}

207
ask/src/sql_generator.rs Normal file
View File

@@ -0,0 +1,207 @@
use anyhow::{Context, Result};
use serde_json::Value;
use std::env;
pub trait SqlGenerator: Send + Sync {
fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String>;
}
pub fn create_sql_generator() -> Result<Box<dyn SqlGenerator>> {
let generator_type = env::var("SQL_GENERATOR").unwrap_or_else(|_| "gemini".to_string());
match generator_type.as_str() {
"sqlcoder" => Ok(Box::new(SqlCoderGenerator::new()?)),
"openrouter" => Ok(Box::new(OpenRouterGenerator::new()?)),
"gemini" => Ok(Box::new(GeminiGenerator::new()?)),
_ => anyhow::bail!(
"Unknown SQL_GENERATOR: {}. Use: sqlcoder, gemini, or openrouter",
generator_type
),
}
}
pub struct GeminiGenerator {
model: String,
api_key: String,
}
impl GeminiGenerator {
pub fn new() -> Result<Self> {
let model = env::var("GEMINI_MODEL").unwrap_or_else(|_| "gemini-flash-latest".to_string());
let api_key = env::var("GEMINI_API_KEY").context("GEMINI_API_KEY not defined")?;
Ok(Self { model, api_key })
}
}
impl SqlGenerator for GeminiGenerator {
fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
let url = format!(
"https://generativelanguage.googleapis.com/v1beta/models/{}:generateContent",
self.model
);
let system_prompt = format!("{}\n\nSchema DDL:\n\n{}", prompt_template.trim(), schema);
let payload = serde_json::json!({
"system_instruction": { "parts": [{ "text": system_prompt }] },
"contents": [{ "parts": [{ "text": question }] }]
});
let client = reqwest::blocking::Client::builder()
.timeout(std::time::Duration::from_secs(300))
.build()?;
let resp = client
.post(&url)
.header("Content-Type", "application/json")
.header("X-goog-api-key", &self.api_key)
.json(&payload)
.send()
.context("Gemini HTTP request failed")?;
let status = resp.status();
let body: Value = resp.json().context("Failed to parse Gemini response")?;
if !status.is_success() {
anyhow::bail!("Gemini API error {}: {}", status, body);
}
let text = body["candidates"][0]["content"]["parts"][0]["text"]
.as_str()
.context("Unexpected Gemini response format")?
.trim()
.to_string();
Ok(strip_fences(&text))
}
}
pub struct OpenRouterGenerator {
model: String,
api_key: String,
}
impl OpenRouterGenerator {
pub fn new() -> Result<Self> {
let model =
env::var("OPENROUTER_MODEL").unwrap_or_else(|_| "openai/gpt-4o-mini".to_string());
let api_key = env::var("OPENROUTER_API_KEY").context("OPENROUTER_API_KEY not defined")?;
Ok(Self { model, api_key })
}
}
impl SqlGenerator for OpenRouterGenerator {
fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
let url = "https://openrouter.ai/api/v1/chat/completions";
let system_prompt = format!("{}\n\nSchema DDL:\n\n{}", prompt_template.trim(), schema);
let payload = serde_json::json!({
"model": self.model,
"messages": [
{ "role": "system", "content": system_prompt },
{ "role": "user", "content": question }
]
});
let client = reqwest::blocking::Client::builder()
.timeout(std::time::Duration::from_secs(300))
.build()?;
let resp = client
.post(url)
.header("Content-Type", "application/json")
.header("Authorization", format!("Bearer {}", self.api_key))
.header("HTTP-Referer", "https://basedosdados.org")
.header("X-Title", "Base dos Dados Ask")
.json(&payload)
.send()
.context("OpenRouter HTTP request failed")?;
let status = resp.status();
let body: Value = resp.json().context("Failed to parse OpenRouter response")?;
if !status.is_success() {
anyhow::bail!("OpenRouter API error {}: {}", status, body);
}
let text = body["choices"][0]["message"]["content"]
.as_str()
.context("Unexpected OpenRouter response format")?
.trim()
.to_string();
Ok(strip_fences(&text))
}
}
pub struct SqlCoderGenerator {
model: String,
host: String,
}
impl SqlCoderGenerator {
pub fn new() -> Result<Self> {
let model = env::var("OLLAMA_MODEL").unwrap_or_else(|_| "sqlcoder".to_string());
let host = env::var("OLLAMA_HOST").unwrap_or_else(|_| "http://localhost:11434".to_string());
Ok(Self { model, host })
}
}
impl SqlGenerator for SqlCoderGenerator {
fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
let url = format!("{}/api/generate", self.host);
let full_prompt = format!(
"{}\n\nSchema DDL:\n\n{}\n\nQuestion: {}\n\nSQL:",
prompt_template.trim(),
schema,
question
);
let payload = serde_json::json!({
"model": self.model,
"prompt": full_prompt,
"stream": false
});
let client = reqwest::blocking::Client::builder()
.timeout(std::time::Duration::from_secs(300))
.build()?;
let resp = client
.post(&url)
.header("Content-Type", "application/json")
.json(&payload)
.send()
.context("Ollama HTTP request failed")?;
let status = resp.status();
let body: Value = resp.json().context("Failed to parse Ollama response")?;
if !status.is_success() {
anyhow::bail!("Ollama API error {}: {}", status, body);
}
let text = body["response"]
.as_str()
.context("Unexpected Ollama response format")?
.trim()
.to_string();
Ok(strip_fences(&text))
}
}
fn strip_fences(text: &str) -> String {
let text = text.trim();
if text.starts_with("```sql") {
let end = text.find("```").unwrap_or(text.len());
text[5..end].trim().to_string()
} else if text.starts_with("```") {
let end = text[3..].find("```").map(|i| i + 3).unwrap_or(text.len());
text[3..end].trim().to_string()
} else {
text.to_string()
}
}

146
ask/src/table_selector.rs Normal file
View File

@@ -0,0 +1,146 @@
use serde::{Deserialize, Serialize};
use std::fs;
use std::path::Path;
const DEFAULT_SIMILARITY_THRESHOLD: f32 = 0.35;
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TableEmbedding {
pub id: String,
pub text: String,
pub embedding: Vec<f32>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EmbeddingsIndex {
pub tables: Vec<TableEmbedding>,
pub model: String,
}
pub struct TableSelector {
tables: Vec<TableEmbedding>,
threshold: f32,
}
impl TableSelector {
pub fn new<P: AsRef<Path>>(embeddings_path: P, threshold: f32) -> anyhow::Result<Self> {
let content = fs::read_to_string(embeddings_path)?;
let index: EmbeddingsIndex = serde_json::from_str(&content)?;
Ok(Self {
tables: index.tables,
threshold,
})
}
pub fn select_tables(
&self,
question: &str,
model: &dyn QuestionEmbedder,
) -> anyhow::Result<Vec<String>> {
let question_embedding = model.embed(question)?;
let mut similarities: Vec<(usize, f32)> = self
.tables
.iter()
.enumerate()
.map(|(i, table)| {
let sim = cosine_similarity(&question_embedding, &table.embedding);
(i, sim)
})
.collect();
similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
let selected: Vec<String> = similarities
.into_iter()
.filter(|(_, sim)| *sim >= self.threshold)
.map(|(i, sim)| {
eprintln!(" {} (similarity: {:.3})", self.tables[i].id, sim);
self.tables[i].id.clone()
})
.collect();
Ok(selected)
}
pub fn get_table_texts(&self, table_ids: &[String]) -> Vec<String> {
table_ids
.iter()
.filter_map(|id| self.tables.iter().find(|t| &t.id == id))
.map(|t| t.text.clone())
.collect()
}
pub fn table_count(&self) -> usize {
self.tables.len()
}
}
pub trait QuestionEmbedder: Send + Sync {
fn embed(&self, text: &str) -> anyhow::Result<Vec<f32>>;
}
pub struct LocalEmbedder {
model_path: String,
}
impl LocalEmbedder {
pub fn new(model_path: String) -> Self {
Self { model_path }
}
}
impl QuestionEmbedder for LocalEmbedder {
fn embed(&self, text: &str) -> anyhow::Result<Vec<f32>> {
use std::process::Command;
let output = Command::new("python3")
.args([
"-c",
&format!(
r#"
import json
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('{}')
emb = model.encode('{}', convert_to_numpy=True)
print(json.dumps([float(x) for x in emb]))
"#,
self.model_path,
text.replace("'", "\\'")
),
])
.output()?;
if !output.status.success() {
let err = String::from_utf8_lossy(&output.stderr);
anyhow::bail!("Embedding generation failed: {}", err);
}
let output_str = String::from_utf8_lossy(&output.stdout);
let floats: Vec<f32> = serde_json::from_str(&output_str)?;
Ok(floats)
}
}
fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
if norm_a == 0.0 || norm_b == 0.0 {
0.0
} else {
dot_product / (norm_a * norm_b)
}
}
pub fn select_tables_from_question(
question: &str,
embeddings_path: &str,
threshold: f32,
) -> anyhow::Result<Vec<String>> {
let selector = TableSelector::new(embeddings_path, threshold)?;
let embedder = LocalEmbedder::new("all-MiniLM-L6-v2".to_string());
selector.select_tables(question, &embedder)
}

View File

@@ -147,3 +147,68 @@ LIMIT 30
if the question requires tables not in the provided DDL, OR if the question requires tables not in the provided DDL, OR
If you cant generate a valid SQL, If you cant generate a valid SQL,
answer as a JSON {error: "#{reason}"} answer as a JSON {error: "#{reason}"}
## Common SQL Pitfalls & Debugging Strategy
### 1. Column Propagation in CTEs (Most Common Error!)
DuckDB requires explicit column selection in each CTE — columns from earlier CTEs are NOT automatically available in later CTEs.
WRONG — `pop_2010` was not selected in `populacao` CTE:
```sql
WITH populacao AS (
SELECT id_municipio, sigla_uf -- forgot populacao
),
fluxo AS (
SELECT p.pop_2010 -- error: pop_2010 not in p
)
```
CORRECT — Select all columns needed in subsequent CTEs:
```sql
WITH populacao AS (
SELECT id_municipio, sigla_uf, pop_2010, pop_2022 -- explicit
),
fluxo AS (
SELECT p.pop_2010 -- works
)
```
### 2. ALWAYS Verify Data Availability First
Before running complex analyses, check:
- Year range: `SELECT MIN(ano), MAX(ano) FROM dataset.table`
- Record count: `SELECT COUNT(*) FROM dataset.table`
- ID format compatibility between tables before JOIN
### 3. Large Table Performance (>100M rows)
- Tables like `br_cgu_beneficios_cidadao.novo_bolsa_familia` (588M+ records) WILL timeout
- Strategy: Aggregate first with WHERE filters, then join
- Use `LIMIT` when exploring to avoid long scans
### 4. Lock Conflicts
Multiple concurrent DuckDB queries on the same `.duckdb` file cause lock errors.
- Wait between queries or use read-only mode
### 5. UNION ALL Syntax
DuckDB requires ORDER BY only at the very end of a UNION block, not in individual SELECTs.
WRONG:
```sql
SELECT ... LIMIT 5
ORDER BY x
UNION ALL
SELECT ... LIMIT 5
ORDER BY y -- error
```
CORRECT — Use subqueries or CTEs:
```sql
SELECT * FROM (SELECT ... ORDER BY x LIMIT 5) a
UNION ALL
SELECT * FROM (SELECT ... ORDER BY y LIMIT 5) b
```
### 6. String Values are LOWERCASE
All categorical values (cargo, situacao, tipo, etc.) are stored in lowercase.
Always use: `WHERE cargo = 'deputado federal'` not `'DEPUTADO FEDERAL'`

Binary file not shown.

File diff suppressed because one or more lines are too long

298355
context/table_embeddings.json Normal file

File diff suppressed because one or more lines are too long

BIN
data/basedosdados.duckdb Normal file

Binary file not shown.

59
docs/dataset_embeds.md Normal file
View File

@@ -0,0 +1,59 @@
## Goal
Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.
## Instructions
- Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
- Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
- Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
- Include column descriptions from basedosdados-schema.json in table embeddings
- Generate word clouds from schema attributes and dataset names for docs
## Discoveries
- **Schema format**: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
- **Embeddings work**: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
- **Threshold tuning**: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
- **sqlcoder issues**: Returns JSON instead of SQL when using `format: "json"` - removing it helps but still generates imperfect SQL
- **Retry mechanism**: Already built into main.rs - helps fix SQL errors automatically
- **Top donation query works**: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)
## Accomplished
1. ✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json
2. ✅ Created table_embeddings.json (~2MB, 765 tables)
3. ✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
4. ✅ Created schema_filter.rs - extracts filtered schema from full JSON
5. ✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
6. ✅ Modified main.rs - integrated table selection + configurable SQL generator
7. ✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes)
8. ✅ Updated README.md with new architecture and env vars
9. ✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/
## Relevant files / directories
### Created/Modified
- `embed_tables.py` - Python script to generate table embeddings
- `context/table_embeddings.json` - Pre-computed embeddings (765 tables)
- `ask/src/table_selector.rs` - Table selection via embeddings
- `ask/src/schema_filter.rs` - Schema filtering module
- `ask/src/sql_generator.rs` - SQL generator trait + implementations
- `ask/src/main.rs` - Integrated all components
- `ask/Cargo.toml` - Added serde dependency
- `README.md` - Updated with new architecture
- `docs/wordcloud_attributes.png` - Word cloud from column names/descriptions
- `docs/wordcloud_datasets.png` - Word cloud from dataset names
### Configuration (env vars)
- `SQL_GENERATOR` - sqlcoder|gemini|openrouter
- `SIMILARITY_THRESHOLD` - 0.35 default
- `OLLAMA_MODEL` - sqlcoder:7b-q4_K_M
- `EMBEDDINGS_FILE`, `SCHEMA_JSON`
## Next Steps
- Increase similarity threshold (try 0.45) to reduce table count
- Improve sqlcoder prompt for better SQL generation
- Add fallback to increase threshold if too many tables selected
- Consider keyword matching as backup if embeddings fail

299
docs/patterns-audit.md Normal file
View File

@@ -0,0 +1,299 @@
# Pattern Audit — Robustness & False Positive Analysis
Deep audit of all 8 risk patterns. For each pattern: legal basis, threshold rationale, known false positive scenarios, data quality notes, and differences between the per-CNPJ (interactive) and batch (scan-all) implementations.
---
## US1 — Split Contracts Below Threshold (`split_contracts_below_threshold`)
### Legal basis
**Fracionamento de licitação** is prohibited by:
- Lei 8.666/1993, art. 23, §5º: "É vedada a utilização da modalidade 'convite' ou 'tomada de preços' [...] para parcelas de uma mesma obra ou serviço."
- Lei 14.133/2021, art. 145: directly prohibits splitting to evade the mandatory bidding requirement.
### Threshold: year-dependent
| Period | Threshold | Legal basis |
|---|---|---|
| ≤ 2023 | R$ 17.600 | Decreto 9.412/2018 / Lei 8.666/93 art. 23, I, "a" |
| 2024+ | R$ 57.912 | Decreto 11.871/2024 / Lei 14.133/2021 art. 75, I |
For 2023 data many contracts still ran under Lei 8.666/93 (both laws co-existed). From 2024 the threshold is R$57.912. Using a static R$17.600 for 2024+ data would miss the main fraud window (R$17kR$57k per contract). **Fixed (iteration 7):** all three implementations compute the threshold from the query year.
### False positive scenarios
1. **Legitimate multi-item purchasing**: A supplier providing diverse small items (office supplies, food for canteen) legitimately generates many small contracts below threshold from the same agency. The `combined_value > threshold` guard reduces but doesn't eliminate this.
2. **Recurring service contracts**: Monthly service fees (e.g., R$1.500/month cleaning) generate 12 contracts/year — correctly NOT flagged (combined = R$18.000 > threshold, count ≥ 3 in first 3 months).
3. **Different sub-units**: The grouping uses `id_orgao_superior` (ministry level). A ministry with many sub-units contracting independently may not be splitting; they may have independent needs.
### Improvements applied
- None structural. Filter `valor_inicial_compra > 0` prevents division issues.
### Known data quality issues
- `data_assinatura_contrato` can be NULL for some older contracts. **`FORMAT_DATE` on NULL returns NULL — it does NOT exclude those rows.** Without a guard, all NULL-dated contracts from the same agency would be grouped together under a single `NULL` month bucket, potentially producing a false flag if ≥3 of them are below threshold with combined value > threshold. Fixed (iteration 5): all three implementations now include `AND data_assinatura_contrato IS NOT NULL` in the WHERE clause.
- `valor_inicial_compra` vs `valor_final_compra`: we use `valor_inicial_compra` intentionally since splitting is defined by the contract as signed, not final.
### Improvements applied (iteration 5)
- Added `AND data_assinatura_contrato IS NOT NULL` to WHERE clause in all three implementations to prevent NULL-date contracts from being grouped into a spurious `mes = NULL` bucket.
### Per-CNPJ vs batch consistency
✅ Fixed (iteration 8): `scan-all.ts` now includes `id_orgao_superior` in both SELECT and GROUP BY, matching `index.ts` and `scan-suspicious.ts`. Prevents theoretical merging of two distinct ministries sharing the same name.
---
## US2 — Contract Concentration (`contract_concentration`)
### Legal basis
No specific legal prohibition, but **TCU** and **CGU** audit methodology treat >40% share of a single agency's budget as a prima facie risk indicator requiring justification.
- Reference: CGU "Manual de Orientações para Análise de Risco em Compras Públicas" (2022), section 4.2.
### Thresholds
- **40% share**: empirical; above this, competition is functionally absent for that agency.
- **R$ 50.000 minimum agency total**: excludes micro-units (small local offices) where one purchase naturally dominates.
- **R$ 10.000 minimum supplier spend** (new, iteration 2): excludes trivial cases like a company with R$21k of a R$50k agency = 42% but both numbers are small.
### False positive scenarios
1. **Specialized niches**: A sole provider of a specialized service (e.g., judicial translation, specific medical device) may legitimately dominate one agency's procurement. No CNAE-based filter exists.
2. **Monopolistic markets**: Some goods/services have few suppliers by nature (utilities, telecommunications infrastructure).
3. **Framework agreements**: A single framework contract can make one supplier appear to dominate even if bidding was competitive at framework establishment.
### Improvements applied
- Added `CONCENTRATION_MIN_SUPPLIER_SPEND = 10_000` to batch query and `scan-suspicious.ts` (iteration 2).
- Added `CONCENTRATION_MIN_SUPPLIER_SPEND` filter to `index.ts` `patternConcentration` HAVING clause (iteration 4 — was present in batch/scan-suspicious but missing from web UI).
### Per-CNPJ vs batch consistency
✅ Fixed (iteration 4): `index.ts` HAVING clause now includes `supplier_spend >= CONCENTRATION_MIN_SUPPLIER_SPEND`.
✅ Fixed (iteration 9): `scan-all.ts` and `scan-suspicious.ts` now group by `(id_orgao_superior, nome_orgao_superior)` in both the spend and ministry_total CTEs, joining on the composite key. All three implementations are consistent.
---
## US3 — Inexigibility Recurrence (`inexigibility_recurrence`)
### Legal basis
**Inexigibilidade de licitação** (Lei 14.133/2021 art. 74; Lei 8.666/93 art. 25) is legal when competition is technically impossible (e.g., exclusive supplier, artistic performances). Abuse occurs when agencies use inexigibilidade repeatedly for the same supplier to avoid competitive bidding.
- Reference: **TCU Acórdão 1.793/2011**: defines recurrent inexigibilidade as a risk indicator requiring documentation of technical exclusivity per contract.
### Threshold: 3 contracts per managing unit
- Below 3: could be two legitimate sole-source needs in the same year.
- At 3+: pattern suggests systematic routing of contracts to avoid bidding.
### False positive scenarios
1. **Legitimate exclusive suppliers**: Publishers (publishing rights), performing arts venues, specialized IT vendors with proprietary systems legitimately receive many inexigibilidade contracts.
2. **Long-term technical partnerships**: An agency may have a multi-year framework with an exclusive technical partner, generating many inexigibilidade contracts each year.
3. **Artistic/cultural organizations**: Museums, theaters, and orchestras commonly contract artists via inexigibilidade.
### Improvements applied (iteration 2)
- **Batch + scan-suspicious**: Now groups by `id_unidade_gestora` (ID) + `nome_unidade_gestora` (name). Previously grouped by name only, risking merger of distinct units sharing a common name.
- **Batch + scan-suspicious**: Added `valor_inicial_compra >= R$ 1.000` filter. Micro-value contracts (< R$1k) rarely represent real abuse.
### Improvements applied (iteration 4)
- **`index.ts`**: Added `AND valor_inicial_compra >= @min_value` to WHERE clause of `patternInexigibility`. The web UI was missing this filter, causing micro-value contracts to inflate the count and trigger false flags.
### Per-CNPJ vs batch consistency
✅ Fixed (iteration 4): all three implementations now filter `valor_inicial_compra >= R$ 1.000` and group by `id_unidade_gestora`.
---
## US4 — Single Bidder (`single_bidder`)
### Legal basis
Not inherently illegal, but flagged by:
- **Open Contracting Partnership "73 Red Flags" (2024)**, Flag #1: "Only one bid received."
- CGU "Programa de Fiscalização em Entes Federativos" 2023: single-bidder rate >30% is a tier-1 risk indicator.
### Threshold: 2 occurrences
- Intentionally low. Even one solo-bid win warrants investigation context. Two is the minimum pattern.
### False positive scenarios
1. **Specialized markets**: Satellite communications, nuclear materials, specialized medical devices — few vendors exist globally.
2. **Geographic isolation**: Remote municipalities with limited local suppliers naturally attract few bidders even for standard goods.
3. **Poorly timed notices**: Short bid windows or holiday periods reduce participation regardless of market structure.
### SQL robustness notes
- Per-CNPJ: uses `STARTS_WITH(REGEXP_REPLACE(...), @cnpj)` — this matches any CNPJ where the base 8 digits match, including subsidiaries/branches. This is intentional: a corporate group that operates through multiple CNPJs should still surface.
- Batch: uses `MAX(IF(vencedor AND LENGTH(...) = 14, SUBSTR(...), NULL))` to extract the winner's CNPJ from the `auction_stats` CTE. The `LENGTH = 14` guard in the `IF` condition ensures CPF winners don't produce invalid 8-digit keys. If two CNPJ rows have `vencedor=true` for the same auction (data quality issue), `MAX` picks lexicographically last — acceptable for batch purposes.
### Per-CNPJ vs batch consistency
✅ Fixed (iteration 8): **batch now counts ALL participants** (CPF + CNPJ) for `total_bidders`, matching per-CNPJ behavior. Previously, `LENGTH = 14` excluded CPF individuals from the count, causing the batch to over-flag auctions where a CPF participant was present. The `LENGTH = 14` guard is now applied only inside the `winner_cnpj` extraction `IF()` condition — not to the overall participant count.
---
## US5 — Always Winner (`always_winner`)
### Legal basis
Not illegal per se, but high win rates in competitive auctions indicate possible:
- Bid rigging (Lei 12.529/2011 art. 36, IV)
- Tailored specifications (Lei 14.133/2021 art. 9, I)
- Reference: **OCDE "Guidelines for Fighting Bid Rigging in Public Procurement" (2021)**
### Thresholds
- **≥80% win rate** (per-CNPJ, fixed) — raised from 60% to reduce false positives. Batch uses dynamic Q3 (empirically ≈100% in this dataset).
- **≥10 competitive participations** — minimum sample for statistical significance. Aligns batch and per-CNPJ.
- **Competitive auctions only (≥2 bidders)** — critical to avoid overlap with US4.
### Critical fix applied (iteration 2)
**The per-CNPJ version was NOT filtering for competitive auctions before this iteration.** A company that always won because it was always the only bidder would be flagged by both US4 (single_bidder) AND US5 (always_winner) — misleading double-counting. Fixed by adding a `competitive_auctions` CTE that filters `COUNT(1) >= 2`.
### Win rate distribution note
The `licitacao_participante` dataset is **strongly bimodal**: approximately 33% of companies with ≥10 competitive participations have a perfect 100% win rate. The distribution does not follow a normal or uniform pattern. Q3 ≈ 1.0 regardless of the minimum sample cutoff (tested at 5, 10, 20). The dynamic Q3 threshold therefore flags only **perfect-win companies** — intentionally strict. This is documented in the spec.
### Per-CNPJ vs batch consistency
✅ Fixed (iteration 2): both now filter for competitive auctions. Batch uses dynamic Q3; per-CNPJ uses fixed 0.80 threshold. The fixed threshold produces a slightly broader result set on the interactive page, which is acceptable — the batch feed should be conservative; per-CNPJ investigation mode can be more sensitive.
---
## US6 — Amendment Inflation (`amendment_inflation`)
### Legal basis
**Lei 14.133/2021 art. 125 §1º**: amendments may not increase the contract value by more than 25% of the original (for goods/services) or 50% (for construction). Inflation ≥ 1.25× means the contract **reached or exceeded its legal ceiling**.
### Threshold: 1.25× (25% above original)
- Exactly the legal maximum. Contracts at 1.25× are at the legal limit; contracts above are potentially illegal unless specific circumstances apply (art. 125 §2º exceptions).
### False positive scenarios
1. **Lawful exceptional amendments**: Art. 125 §2º allows exceeding 25% for "additional work indispensable to the object's completion" — requires specific administrative justification.
2. **Construction contracts**: Legal ceiling is 50% (not 25%). Our threshold of 1.25× flags construction contracts that are within the legal limit.
3. **Value adjustment clauses**: Contracts with inflation adjustment clauses (INPC/IPCA) can legitimately reach or exceed 1.25× over multi-year terms without any amendment.
4. **Data entry errors**: Some `valor_final_compra` values are clearly data quality issues (e.g., 100× original).
### Improvements applied (iteration 3)
- **Cap `inflation_ratio` at 10×** (`AMENDMENT_MAX_INFLATION_RATIO = 10.0`): ratios above this threshold are almost certainly data entry errors (e.g., `valor_final_compra` entered in a different unit) and would distort `total_excess` reporting. Applied to all three implementations via `AND ... <= @max_ratio` filter in SQL. Applied in `index.ts`, `scan-all.ts`, `scan-suspicious.ts`.
### Schema verification: construction vs goods/services threshold
Lei 14.133/2021 art.125 §1º allows 50% amendments for engineering works vs 25% for goods/services.
**Column verified (schema dump):** `contrato_compra` has `id_modalidade_licitacao` (code) and `modalidade_licitacao` (name). However, this column encodes **bidding modality** (Concorrência, Pregão Eletrônico, Tomada de Preços, etc.) — not contract category (obras vs bens/serviços). There is no `tipo_contrato` or `categoria` column in the accessible schema.
### Improvements applied (iteration 8): construction keyword detection
All three implementations now apply `IF(REGEXP_CONTAINS(LOWER(IFNULL(objeto, '')), r'obra|constru|reform|engenhari|paviment|demoli'), 1.50, 1.25)` to select the applicable legal threshold per contract. This reduces false positives for legitimate construction/engineering amendments that fall between 1.25× and 1.50×.
**Keywords and rationale:**
| Keyword | Matches | Rationale |
|---------|---------|-----------|
| `obra` | obra, obras | General construction work |
| `constru` | construção, construir | Building/construction |
| `reform` | reforma, reformar, reformas | Renovation/remodeling |
| `engenhari` | engenharia, engenheiro | Engineering services |
| `paviment` | pavimentação, pavimento | Road/floor paving |
| `demoli` | demolição, demolir | Demolition |
**Known limitations:** The `objeto` field is free-text entered by procurement officers. Some construction contracts may use generic descriptions ("serviços de manutenção") and be missed by this detection — applying the 1.25× threshold is safe for those (conservative false positive vs missed construction exemption).
### Improvements applied (iteration 9): constructionCount field
`AmendmentInflationFlag` now includes `constructionCount`: the number of flagged contracts that matched the construction keywords and were therefore evaluated at the 1.50× threshold. The UI card shows this count with a tooltip explaining the applicable legal ceiling. This helps analysts distinguish "inflated by >25% on goods (potentially illegal)" from "inflated by >50% on obras (definitely exceeds even the construction ceiling)."
### Per-CNPJ vs batch consistency
⚠️ Minor divergence (accepted): `index.ts` includes the aditivos CTE (`zeroAmendmentCount`) and `constructionCount` from `is_construction`. The batch scanners do NOT include these — `contrato_termo_aditivo` full scan is too expensive in batch, and `constructionCount` is per-row info not aggregable without the row-level data. Both fields are only available in the web UI's per-CNPJ output.
---
## US7 — Newborn Company (`newborn_company`)
### Legal basis
No specific prohibition, but:
- **Lei 14.133/2021 art. 68, I**: suppliers must demonstrate technical and economic qualification. Newly incorporated companies rarely can.
- CGU "Guia Prático de Análise de Empresas de Fachada" (2021): age < 6 months at contract signing is a tier-1 indicator of possible shell company.
### Thresholds
- **180 days** (6 months): practical minimum for legitimate operational readiness.
- **R$ 50.000 minimum contract value**: excludes training contracts and small acquisitions where new companies are common and low-risk.
### False positive scenarios
1. **Spinoffs and restructurings**: A newly incorporated CNPJ may be a restructured entity of an existing business with full operational capacity.
2. **Holding company structures**: A holding created to receive a specific contract may have the technical capacity of its parent, not its founding date.
3. **Startups in innovation programs**: Government startup accelerator programs (e.g., FAPESP TT, EMBRAPII) specifically contract very new companies.
4. **`data_inicio_atividade` from establishments**: The founding date comes from `br_me_cnpj.estabelecimentos`, not `empresas`. Branches opened after the headquarter can make an established company appear "newborn" in a specific municipality.
### Data quality note
`data_inicio_atividade` lives in `br_me_cnpj.estabelecimentos`, NOT `empresas`. The query uses `MIN(est.data_inicio_atividade)` across all establishments for the same `cnpj_basico` — this correctly picks the earliest known opening date, reducing the false positive of branches.
### Per-CNPJ vs batch consistency
✅ Equivalent. Both use `MIN(data_inicio_atividade)` across establishments with `ano=2023 AND mes=12`.
⚠️ **Known necessary full-table scan**: The `first_contract` CTE in `batchNewborn` (`scan-all.ts`) intentionally omits an `ano` filter on `contrato_compra`:
```sql
FROM `basedosdados.br_cgu_licitacao_contrato.contrato_compra`
WHERE LENGTH(REGEXP_REPLACE(cpf_cnpj_contratado, r'\D', '')) = 14
AND valor_final_compra >= <MIN_VALUE>
GROUP BY cnpj_basico
```
This is a deliberate exception to the "zero full-table scans" rule from the spec. The pattern asks: *"did this company win its very first contract within 180 days of founding?"* Restricting to `ano = ANO` would miss the true first contract if it occurred in an earlier year — producing a false negative. The `founding` CTE correctly filters `e.ano = ANO AND est.ano = ANO AND est.mes = 12`. Only `first_contract` scans all years, but the `LENGTH = 14` CPF exclusion and `valor_final_compra >= R$ 50k` filter significantly reduce bytes scanned.
---
## US8 — Sudden Surge (`sudden_surge`)
### Legal basis
Not illegal, but flagged by:
- **UNODC "Guidebook on anti-corruption in public procurement" (2013)**: "Sudden large increase in a company's public contract revenue" is a tier-2 risk indicator.
- TCU Acórdão 2.622/2015: large YoY procurement increases without prior procurement history warrant scrutiny.
### Thresholds
- **5× YoY growth**: chosen to exclude normal business growth (2-3×) while flagging exponential jumps.
- **R$ 1.000.000 minimum**: a 5× jump from R$200k to R$1M is meaningful; from R$10k to R$50k is noise.
- **4-year lookback**: captures context before the surge.
### False positive scenarios
1. **Post-restructuring recovery**: A company that was inactive for 2 years then resumed full operations would appear to surge.
2. **New framework agreements**: Being added to a large framework agreement in year N can produce apparent surge with no underlying change in the company.
3. **Government budget cycles**: Some sectors receive large multi-year contracts every 4 years (e.g., IT system replacements) creating apparent surges.
### SQL robustness note
Both per-CNPJ and batch use `prev_v > 0` guard to exclude zero→nonzero transitions (handled by US7 newborn_company instead). The batch uses `LAG` window function; per-CNPJ iterates over the history array client-side.
**Consecutive-year guard (iteration 6):** The spec says `value[year_N] / value[year_N-1]`. Without a guard, `LAG` compares any adjacent rows in sorted order — if a company had data in 2019 and 2023 (dormant 20202022), the comparison spans 4 years and produces a false surge. Fixed by:
- `scan-all.ts`: added `LAG(ano)` alongside `LAG(v)` and `WHERE ano - prev_ano = 1`
- `index.ts`, `scan-suspicious.ts`: added `curr.ano - prev.ano === 1` to the JS loop condition
**false positive (false negative from audit):** The first false positive scenario (post-restructuring recovery) is now LESS likely to trigger since the consecutive-year guard would catch companies dormant for ≥1 year.
The per-CNPJ implementation reports only the **first** qualifying surge year (breaks on first hit). If a company surged twice, only the earlier event is shown. This is conservative.
### Per-CNPJ vs batch consistency
✅ Equivalent. Batch uses SQL `LAG`; per-CNPJ uses JS loop. Both find the first qualifying year.
---
## Infrastructure Issue: Cache Miss vs Stored Null
### Bug 1: Cache Miss vs Stored Null (fixed iteration 6)
`cache.ts` `getCache` was returning `null` for both cache misses (file not found) and legitimately stored null values (pattern found nothing). Patterns US4US8 and the company lookup all use `null` as their "nothing found" sentinel and check `cached !== undefined` to skip re-querying. With the old `getCache` returning `null` on miss, `null !== undefined` evaluated to `true`, causing the BigQuery query to be skipped permanently — US4US8 would never execute on a CNPJ not yet in cache.
**Fix:** `getCache` now returns `undefined` on miss or expiry; returns `T` (including `null`) on a valid cache hit. The company-lookup caller that used `!== null` was updated to `!== undefined`.
### Bug 2: Falsy cache check for array-returning patterns (fixed iteration 7)
US1, US2, US3, and `runPatterns()` in `index.ts` used `if (cached) return cached` to check for cache hits. An empty array `[]` is **falsy** in JavaScript — so a cached "no flags found" result (a real cache hit) was silently discarded, causing BigQuery to be re-queried on every subsequent call for clean CNPJs.
Affected: `patternSplitContracts`, `patternConcentration`, `patternInexigibility`, `runPatterns`.
**Fix:** changed all four to `if (cached !== undefined) return cached`. (US4US8 already used this pattern since they cache `null` as "nothing found" — they were correct.)
---
## Cross-Pattern Issues
### Overlap between US4 and US5
- **Before iteration 2**: US5 per-CNPJ would flag solo-bid winners as "always winner", creating confusing double flags.
- **After iteration 2**: US5 filters to competitive auctions only. A pure solo-bid company gets US4 only; a company that wins competitive auctions at high rates gets US5 only; both behaviors together get both flags independently.
### Overlap between US7 and US8
- A newborn company with a sudden surge would be flagged by both US7 (age at contract) and US8 (YoY growth). This is intentional and additive — both signals reinforce each other.
### CNPJ matching strategy
All patterns use `cnpj_basico` (8-digit root) as the joining key. This means **all branches and subsidiaries** of a corporate group are attributed to the same `cnpj_basico`. This can create false positives for large corporations with many legitimate establishments (e.g., Correios, Petrobras) that naturally have contracts across many agencies.
---
## Summary Table
| Pattern | FP Risk | Legal Basis | Fixes Applied |
|---------|---------|------------|---------------|
| US1 Split | Medium — multi-item purchasing | Decreto 9.412/2018 / Decreto 11.871/2024 | NULL date guard; year-dependent threshold (R$17.600 ≤2023, R$57.912 2024+); falsy cache check fixed; **batch GROUP BY now includes id_orgao_superior** |
| US2 Concentration | Medium — specialized markets | CGU 2022 methodology | Added min supplier spend to all 3 implementations; **falsy cache check fixed**; **all 3 now GROUP BY (id+name) — no ministry-name collision** |
| US3 Inexigibility | High — legitimate exclusive suppliers | TCU Acórdão 1.793/2011 | Fixed grouping by ID; added min value to all 3 implementations; **falsy cache check fixed** |
| US4 Single Bidder | Medium — specialized/remote markets | OCP 2024 Flag #1 | **cache.ts bug fixed** (getCache null-vs-undefined); **batch now counts all participants (CPF+CNPJ)** — consistent with per-CNPJ |
| US5 Always Winner | **Was HIGH** (no competitive filter) → Now Medium | OCDE 2021 | Fixed: competitive auctions only; raised thresholds; **cache.ts bug fixed** |
| US6 Amendment | Medium — inflation clauses | Lei 14.133/2021 art.125 | Added 10× inflation cap; **cache.ts bug fixed**; **construction keyword detection: 1.50× threshold for obras/etc.**; **constructionCount in UI flag** |
| US7 Newborn | High — spinoffs, restructurings | CGU 2021 guide | **cache.ts bug fixed** (was never querying BigQuery on cache miss) |
| US8 Surge | Medium — framework agreements, budget cycles | UNODC 2013 | Added consecutive-year guard; **cache.ts bug fixed** |

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 MiB

View File

@@ -0,0 +1,45 @@
#!/usr/bin/env python3
import json
import re
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
STOPWORDS = {'de', 'do', 'da', 'a', 'ou', 'em', 'e', 'o', 'que', 'das', 'dos', 'nos', 'nas', 'um', 'uma', 'para', 'com', 'não', 'uma', 'à', 'ao', 'os', 'as', 'se', 'na', 'no', 'de', 'do', 'da', 'é', 'ser', 'seu', 'sua', 'isso', 'the', 'of', 'and', 'in', 'to', 'is', 'for', 'on', 'with', 'at', 'by', 'from'}
with open('context/basedosdados-schema.json') as f:
schema = json.load(f)
words = []
for dataset, tables in schema.items():
for table, cols in tables.items():
for col in cols:
name = col.get('name', '').lower()
desc = col.get('description', '').lower()
if name and len(name) >= 3:
words.append(name)
if desc:
for w in desc.split():
w = re.sub(r'[^a-záàâãéèêíìîóòôõúùûç]', '', w)
if len(w) >= 3 and w not in STOPWORDS:
words.append(w)
word_freq = Counter(words)
wc = WordCloud(
width=1600,
height=800,
background_color='white',
max_words=200,
colormap='viridis',
min_font_size=8
).generate_from_frequencies(word_freq)
plt.figure(figsize=(20, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.savefig('docs/wordcloud_attributes.png', dpi=150, bbox_inches='tight')
print("Saved docs/wordcloud_attributes.png")
print(f"Total unique words: {len(word_freq)}")
print("Top 30:", word_freq.most_common(30))

BIN
docs/wordcloud_datasets.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.3 MiB

View File

@@ -0,0 +1,33 @@
#!/usr/bin/env python3
import json
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
with open('context/basedosdados-schema.json') as f:
schema = json.load(f)
dataset_names = []
for dataset in schema.keys():
parts = dataset.replace('br_', '').replace('mundo_', '').replace('eu_', '').split('_')
dataset_names.extend([p for p in parts if len(p) >= 3])
word_freq = Counter(dataset_names)
wc = WordCloud(
width=1600,
height=800,
background_color='white',
max_words=100,
colormap='plasma',
min_font_size=10
).generate_from_frequencies(word_freq)
plt.figure(figsize=(20, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.savefig('docs/wordcloud_datasets.png', dpi=150, bbox_inches='tight')
print("Saved docs/wordcloud_datasets.png")
print(f"Total unique words: {len(word_freq)}")
print("Top 30:", word_freq.most_common(30))

View File

@@ -1,268 +0,0 @@
import os
import json
import sys
import pyarrow.parquet as pq
import s3fs
import boto3
import duckdb
from dotenv import load_dotenv
load_dotenv()
S3_ENDPOINT = os.environ["HETZNER_S3_ENDPOINT"]
S3_BUCKET = os.environ["HETZNER_S3_BUCKET"]
ACCESS_KEY = os.environ["AWS_ACCESS_KEY_ID"]
SECRET_KEY = os.environ["AWS_SECRET_ACCESS_KEY"]
s3_host = S3_ENDPOINT.removeprefix("https://").removeprefix("http://")
# --- boto3 client (listing only, zero egress) ---
boto = boto3.client(
"s3",
endpoint_url=S3_ENDPOINT,
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY,
)
# --- s3fs filesystem (footer-only reads via pyarrow) ---
fs = s3fs.S3FileSystem(
client_kwargs={"endpoint_url": S3_ENDPOINT},
key=ACCESS_KEY,
secret=SECRET_KEY,
)
# ------------------------------------------------------------------ #
# Phase 1: File inventory via S3 List API (zero data egress)
# ------------------------------------------------------------------ #
print("Phase 1: listing S3 objects...")
paginator = boto.get_paginator("list_objects_v2")
inventory = {} # "dataset/table" -> {files: [...], total_size: int}
for page in paginator.paginate(Bucket=S3_BUCKET):
for obj in page.get("Contents", []):
key = obj["Key"]
if not key.endswith(".parquet"):
continue
parts = key.split("/")
if len(parts) < 3:
continue
dataset, table = parts[0], parts[1]
dt = f"{dataset}/{table}"
if dt not in inventory:
inventory[dt] = {"files": [], "total_size_bytes": 0}
inventory[dt]["files"].append(key)
inventory[dt]["total_size_bytes"] += obj["Size"]
print(f" Found {len(inventory)} tables across {S3_BUCKET}")
# ------------------------------------------------------------------ #
# Phase 2: Schema reads — footer only (~30 KB per table)
# ------------------------------------------------------------------ #
print("Phase 2: reading parquet footers...")
def fmt_size(b):
for unit in ("B", "KB", "MB", "GB", "TB"):
if b < 1024 or unit == "TB":
return f"{b:.1f} {unit}"
b /= 1024
def extract_col_descriptions(schema):
"""Try to pull per-column descriptions from Arrow metadata."""
descriptions = {}
meta = schema.metadata or {}
# BigQuery exports embed a JSON blob under b'pandas' with column_info
pandas_meta_raw = meta.get(b"pandas") or meta.get(b"pandas_metadata")
if pandas_meta_raw:
try:
pm = json.loads(pandas_meta_raw)
for col in pm.get("columns", []):
name = col.get("name")
desc = col.get("metadata", {}) or {}
if isinstance(desc, dict) and "description" in desc:
descriptions[name] = desc["description"]
except Exception:
pass
# Also try top-level b'description' or b'schema'
for key in (b"description", b"schema", b"BigQuery:description"):
val = meta.get(key)
if val:
try:
descriptions["__table__"] = val.decode("utf-8", errors="replace")
except Exception:
pass
return descriptions
schemas = {}
errors = []
for i, (dt, info) in enumerate(sorted(inventory.items())):
dataset, table = dt.split("/", 1)
first_file = info["files"][0]
s3_path = f"{S3_BUCKET}/{first_file}"
try:
schema = pq.read_schema(fs.open(s3_path))
col_descs = extract_col_descriptions(schema)
# Build raw metadata dict (decode bytes keys/values)
raw_meta = {}
if schema.metadata:
for k, v in schema.metadata.items():
try:
dk = k.decode("utf-8", errors="replace")
dv = v.decode("utf-8", errors="replace")
# Try to parse JSON values
try:
dv = json.loads(dv)
except Exception:
pass
raw_meta[dk] = dv
except Exception:
pass
columns = []
for field in schema:
col = {
"name": field.name,
"type": str(field.type),
"nullable": field.nullable,
}
if field.name in col_descs:
col["description"] = col_descs[field.name]
# Check field-level metadata
if field.metadata:
for k, v in field.metadata.items():
try:
dk = k.decode("utf-8", errors="replace")
dv = v.decode("utf-8", errors="replace")
if dk in ("description", "DESCRIPTION", "comment"):
col["description"] = dv
except Exception:
pass
columns.append(col)
schemas[f"{dataset}.{table}"] = {
"path": f"s3://{S3_BUCKET}/{dataset}/{table}/",
"file_count": len(info["files"]),
"total_size_bytes": info["total_size_bytes"],
"total_size_human": fmt_size(info["total_size_bytes"]),
"columns": columns,
"metadata": raw_meta,
}
print(f" [{i+1}/{len(inventory)}] ✓ {dataset}.{table} ({len(columns)} cols, {fmt_size(info['total_size_bytes'])})")
except Exception as e:
errors.append({"table": f"{dataset}.{table}", "error": str(e)})
print(f" [{i+1}/{len(inventory)}] ✗ {dataset}.{table}: {e}", file=sys.stderr)
# ------------------------------------------------------------------ #
# Phase 3: Enrich from br_bd_metadados.bigquery_tables (small table)
# ------------------------------------------------------------------ #
META_TABLE = "br_bd_metadados.bigquery_tables"
meta_dt = "br_bd_metadados/bigquery_tables"
if meta_dt in inventory:
print(f"Phase 3: enriching from {META_TABLE}...")
try:
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute(f"""
SET s3_endpoint='{s3_host}';
SET s3_access_key_id='{ACCESS_KEY}';
SET s3_secret_access_key='{SECRET_KEY}';
SET s3_url_style='path';
""")
meta_path = f"s3://{S3_BUCKET}/br_bd_metadados/bigquery_tables/*.parquet"
# Peek at available columns
available = [r[0] for r in con.execute(f"DESCRIBE SELECT * FROM '{meta_path}' LIMIT 1").fetchall()]
print(f" Metadata columns: {available}")
# Try to find dataset/table description columns
desc_col = next((c for c in available if "description" in c.lower()), None)
ds_col = next((c for c in available if c.lower() in ("dataset_id", "dataset", "schema_name")), None)
tbl_col = next((c for c in available if c.lower() in ("table_id", "table_name", "table")), None)
if desc_col and ds_col and tbl_col:
rows = con.execute(f"""
SELECT {ds_col}, {tbl_col}, {desc_col}
FROM '{meta_path}'
""").fetchall()
for ds, tbl, desc in rows:
key = f"{ds}.{tbl}"
if key in schemas and desc:
schemas[key]["table_description"] = desc
print(f" Enriched {len(rows)} table descriptions")
else:
print(f" Could not find expected columns (dataset_id, table_id, description) — skipping enrichment")
con.close()
except Exception as e:
print(f" Enrichment failed: {e}", file=sys.stderr)
else:
print("Phase 3: br_bd_metadados.bigquery_tables not in S3 — skipping enrichment")
# ------------------------------------------------------------------ #
# Phase 4a: Write schemas.json
# ------------------------------------------------------------------ #
print("Phase 4: writing outputs...")
output = {
"_meta": {
"bucket": S3_BUCKET,
"total_tables": len(schemas),
"total_size_bytes": sum(v["total_size_bytes"] for v in schemas.values()),
"total_size_human": fmt_size(sum(v["total_size_bytes"] for v in schemas.values())),
"errors": errors,
},
"tables": dict(sorted(schemas.items())),
}
with open("schemas.json", "w", encoding="utf-8") as f:
json.dump(output, f, ensure_ascii=False, indent=2)
print(f" ✓ schemas.json ({len(schemas)} tables)")
# ------------------------------------------------------------------ #
# Phase 4b: Write file_tree.md
# ------------------------------------------------------------------ #
lines = [
f"# S3 File Tree: {S3_BUCKET}",
"",
]
# Group by dataset
datasets_map = {}
for dt_key, info in sorted(inventory.items()):
dataset, table = dt_key.split("/", 1)
datasets_map.setdefault(dataset, []).append((table, info))
total_files = sum(len(v["files"]) for v in inventory.values())
total_bytes = sum(v["total_size_bytes"] for v in inventory.values())
for dataset, tables in sorted(datasets_map.items()):
ds_bytes = sum(i["total_size_bytes"] for _, i in tables)
ds_files = sum(len(i["files"]) for _, i in tables)
lines.append(f"## {dataset}/ ({len(tables)} tables, {fmt_size(ds_bytes)}, {ds_files} files)")
lines.append("")
for table, info in sorted(tables):
schema_entry = schemas.get(f"{dataset}.{table}", {})
ncols = len(schema_entry.get("columns", []))
col_str = f", {ncols} cols" if ncols else ""
table_desc = schema_entry.get("table_description", "")
desc_str = f"{table_desc}" if table_desc else ""
lines.append(f" - **{table}/** ({len(info['files'])} files, {fmt_size(info['total_size_bytes'])}{col_str}){desc_str}")
lines.append("")
lines += [
"---",
f"**Total: {len(inventory)} tables · {fmt_size(total_bytes)} · {total_files} parquet files**",
]
with open("file_tree.md", "w", encoding="utf-8") as f:
f.write("\n".join(lines) + "\n")
print(f" ✓ file_tree.md ({len(inventory)} tables)")
print()
print("Done!")
print(f" schemas.json — full column-level schema dump")
print(f" file_tree.md — bucket tree with sizes")
if errors:
print(f" {len(errors)} tables failed (see schemas.json _meta.errors)")

View File

@@ -1,4 +0,0 @@
duckdb
boto3
python-dotenv
openai

42
scripts/build_ask.sh Executable file
View File

@@ -0,0 +1,42 @@
#!/bin/bash
set -e
cd "$(dirname "$0")"
echo "=== Building ask binary for Linux x86_64 ==="
echo "Using Debian x86_64 container for native build..."
# Build in an x86_64 Debian container - this gives us a real x86_64 environment
# so we can build natively without cross-compilation complexity
# Use ask/ as context to avoid .dockerignore excluding src/
docker build \
--platform linux/amd64 \
-t ask-builder \
--build-arg BUILDKIT_INLINE_CACHE=1 \
-f - ask/ <<'EOF'
FROM rust:1.85-slim
RUN apt-get update -qq && \
apt-get install -y --no-install-recommends \
build-essential pkg-config libssl-dev && \
apt-get clean && rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY . ./
RUN cargo build --release --locked
FROM scratch
COPY --from=0 /build/target/release/ask /ask
EOF
echo "=== Extracting binary ==="
# Extract the binary from the container
docker run --rm --platform linux/amd64 ask-builder cat /ask > ./ask/target/release/ask
# Make it executable
chmod +x ./ask/target/release/ask
echo "=== Binary built successfully ==="
file ./ask/target/release/ask
ls -lh ./ask/target/release/ask

View File

@@ -62,7 +62,8 @@ if $GCLOUD_RUN; then
exit 1 exit 1
fi fi
done done
else elif ! $SYNC_RUN; then
# Only require heavy GCP tools for the main export (not for --sync)
for cmd in bq gcloud gsutil parallel rclone flock; do for cmd in bq gcloud gsutil parallel rclone flock; do
if ! command -v "$cmd" &>/dev/null; then if ! command -v "$cmd" &>/dev/null; then
log_err "'$cmd' not found. Install google-cloud-sdk, GNU parallel, and rclone." log_err "'$cmd' not found. Install google-cloud-sdk, GNU parallel, and rclone."
@@ -164,8 +165,8 @@ echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.clou
| sudo tee /etc/apt/sources.list.d/google-cloud-sdk.list >/dev/null | sudo tee /etc/apt/sources.list.d/google-cloud-sdk.list >/dev/null
sudo apt-get update -qq sudo apt-get update -qq
sudo apt-get install -y google-cloud-cli sudo apt-get install -y google-cloud-cli
chmod +x ~/roda.sh chmod +x ~/roda.sh
echo "Dependencies installed." echo "Dependencies installed."
REMOTE_SETUP REMOTE_SETUP
log " Dependencies ready." log " Dependencies ready."
@@ -197,6 +198,121 @@ REMOTE_SETUP
exit 0 exit 0
fi fi
# =============================================================================
# VM EXPORT — use existing bd-export-vm to export specific tables to GCS → S3
# =============================================================================
if [[ "${1:-}" == "--vm-export" ]]; then
VM_NAME="${GCP_VM_NAME:-bd-export-vm}"
VM_ZONE="${GCP_VM_ZONE:-us-central1-a}"
VM_PROJECT="${GCP_VM_PROJECT:-raspa-491716}"
TABLE_LIST="${2:-missing_tables.txt}"
log "=============================="
log " VM EXPORT MODE"
log " VM: $VM_NAME ($VM_ZONE)"
log " Tables: $TABLE_LIST"
log "=============================="
if [[ ! -f "$TABLE_LIST" ]]; then
log_err "Table list not found: $TABLE_LIST"
exit 1
fi
log "[1/5] Syncing files to VM..."
gcloud compute scp \
"$(dirname "$0")/roda.sh" \
"$(dirname "$0")/.env" \
"$(realpath "$TABLE_LIST")" \
"$VM_NAME:~/" \
--zone="$VM_ZONE" \
--project="$VM_PROJECT"
log "[2/5] Ensuring GCS bucket exists..."
if ! gsutil ls "gs://$BUCKET_NAME" &>/dev/null; then
gsutil mb -p "$VM_PROJECT" -l "$BUCKET_REGION" -b on "gs://$BUCKET_NAME"
log " Bucket created: gs://$BUCKET_NAME"
else
log " Bucket already exists."
fi
log "[3/5] Running export on VM (bq extract + rclone)..."
gcloud compute ssh "$VM_NAME" \
--zone="$VM_ZONE" \
--project="$VM_PROJECT" \
--command="bash -s" <<'REMOTE_EXPORT'
set -euo pipefail
export DEBIAN_FRONTEND=noninteractive
cd ~
set -a
# shellcheck source=.env
source .env
set +a
source ~/.bashrc 2>/dev/null || true
export RCLONE_CONFIG_BD_TYPE="google cloud storage"
export RCLONE_CONFIG_BD_BUCKET_POLICY_ONLY="true"
export RCLONE_CONFIG_HZ_TYPE="s3"
export RCLONE_CONFIG_HZ_PROVIDER="Other"
export RCLONE_CONFIG_HZ_ENDPOINT="$HETZNER_S3_ENDPOINT"
export RCLONE_CONFIG_HZ_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID"
export RCLONE_CONFIG_HZ_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY"
echo "[BQ EXTRACT] Starting export of missing tables..."
extract_table() {
local table="$1"
local dataset table_id gcs_prefix
dataset=$(echo "$table" | cut -d. -f1)
table_id=$(echo "$table" | cut -d. -f2)
gcs_prefix="gs://$BUCKET_NAME/$dataset/$table_id"
echo "[EXTRACT] $table"
bq extract \
--project_id="$YOUR_PROJECT" \
--destination_format=PARQUET \
--compression=ZSTD \
--location=US \
"${SOURCE_PROJECT}:${dataset}.${table_id}" \
"${gcs_prefix}/*.parquet" 2>&1 \
|| echo "[FAIL] $table"
}
export -f extract_table
export BUCKET_NAME SOURCE_PROJECT
cat missing_tables.txt | parallel -j8 --bar extract_table {}
echo "[TRANSFER] GCS → Hetzner S3..."
datasets=$(gsutil ls "gs://$BUCKET_NAME/" 2>/dev/null | sed 's|gs://[^/]*/||;s|/$||' | grep -v '^$' | sort -u)
for ds in $datasets; do
echo "[TRANSFER] $ds"
rclone copy "bd:$BUCKET_NAME/$ds/" "hz:$HETZNER_S3_BUCKET/$ds/" \
--transfers 32 --s3-upload-concurrency 32 --progress 2>&1 \
|| echo "[FAIL_TRANSFER] $ds"
done
echo "[DONE] Export complete."
REMOTE_EXPORT
log "[4/5] Verifying transfer..."
S3_COUNT=$(gcloud compute ssh "$VM_NAME" \
--zone="$VM_ZONE" \
--project="$VM_PROJECT" \
--command="source .env && rclone ls hz:\$HETZNER_S3_BUCKET 2>/dev/null | grep -c '\.parquet\$' || echo 0" 2>/dev/null)
log " S3 parquet files: $S3_COUNT"
log "[5/5] Cleaning up GCS bucket..."
read -rp "Delete GCS bucket gs://$BUCKET_NAME? [y/N] " confirm
if [[ "$confirm" =~ ^[Yy]$ ]]; then
gsutil -m rm -r "gs://$BUCKET_NAME"
gsutil rb "gs://$BUCKET_NAME"
log " Bucket deleted."
fi
log "VM export complete."
exit 0
fi
# ============================================================================= # =============================================================================
# SYNC — BigQuery → S3 direct (no GCS intermediary) # SYNC — BigQuery → S3 direct (no GCS intermediary)
# ============================================================================= # =============================================================================

View File

@@ -19,13 +19,13 @@ SQL
chmod 600 /app/ssh_init.sql chmod 600 /app/ssh_init.sql
echo "[start] Starting ttyd terminal (db)..." echo "[start] Starting ttyd terminal (db)..."
ttyd --port 7681 --writable duckdb -readonly --init /app/ssh_init.sql /app/basedosdados.duckdb & ttyd --port 7681 --writable duckdb -readonly --init /app/ssh_init.sql /app/data/basedosdados.duckdb &
echo "[start] Starting ttyd terminal (ask)..." echo "[start] Starting ttyd terminal (ask)..."
ttyd --port 7682 --writable python3 /app/ask.py & ttyd --port 7682 --writable /app/ask &
echo "[start] Starting auth service..." echo "[start] Starting auth service..."
python3 /app/auth.py & python3 /app/shell/auth.py &
echo "[start] Starting Caddy..." echo "[start] Starting Caddy..."
exec caddy run --config /app/Caddyfile --adapter caddyfile exec caddy run --config /app/Caddyfile --adapter caddyfile

View File

@@ -1,543 +0,0 @@
#!/usr/bin/env python3
"""
sync_bq_to_local.py
Syncs missing tables from BigQuery (basedosdados project) to Hetzner S3,
then registers them as DuckDB views.
Usage:
python3 sync_bq_to_local.py # full sync
python3 sync_bq_to_local.py --dry-run # list missing tables only
python3 sync_bq_to_local.py --resume # resume from last run
Prerequisites:
gcloud auth application-default login
GCP project with billing enabled (free tier: 1 TB/month)
Environment (.env):
GCP_PROJECT - GCP project ID for billing
HETZNER_S3_BUCKET - S3 bucket name
HETZNER_S3_ENDPOINT - S3 endpoint URL
AWS_ACCESS_KEY_ID - S3 access key
AWS_SECRET_ACCESS_KEY - S3 secret key
"""
import os
import sys
import json
import argparse
import logging
import subprocess
from datetime import datetime
from pathlib import Path
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
import boto3
from botocore.config import Config as BotoConfig
from google.cloud import bigquery
# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------
LOG_FILE = f"sync_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[
logging.FileHandler(LOG_FILE),
logging.StreamHandler(sys.stdout),
],
)
log = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
SOURCE_PROJECT = "basedosdados"
MISSING_TABLES_FILE = "tasks/datasets_to_scrap.md"
DONE_FILE = "done_sync.txt"
FAILED_FILE = "failed_sync.txt"
DATA_DIR = "data"
PARQUET_DIR = "parquet"
MAX_RETRIES = 3
BATCH_SIZE = 1 # export one table at a time to manage memory
WORKERS = 4 # parallel uploads
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def load_env():
"""Load required environment variables."""
from dotenv import load_dotenv
load_dotenv()
required = [
"GCP_PROJECT",
"HETZNER_S3_BUCKET",
"HETZNER_S3_ENDPOINT",
"AWS_ACCESS_KEY_ID",
"AWS_SECRET_ACCESS_KEY",
]
missing = [v for v in required if not os.environ.get(v)]
if missing:
log.error("Missing env vars: %s", missing)
sys.exit(1)
return {v: os.environ[v] for v in required}
def get_s3_client(env):
"""Create boto3 S3 client configured for Hetzner."""
return boto3.client(
"s3",
endpoint_url=env["HETZNER_S3_ENDPOINT"],
aws_access_key_id=env["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=env["AWS_SECRET_ACCESS_KEY"],
config=BotoConfig(s3={"addressing_style": "path"}),
)
def get_bq_client():
"""Create BigQuery client using Application Default Credentials."""
try:
os.environ["GOOGLE_CLOUD_PROJECT"] = os.environ.get("GCP_PROJECT", "")
os.environ["GCLOUD_PROJECT"] = os.environ.get("GCP_PROJECT", "")
client = bigquery.Client(project=os.environ.get("GCP_PROJECT", ""))
# Test the connection
list(client.list_datasets(max_results=1))
return client
except Exception as e:
log.error("BigQuery auth failed: %s", e)
log.error("")
log.error("Run these commands to authenticate:")
log.error(" gcloud auth login")
log.error(" gcloud auth application-default login")
log.error(" gcloud config set project %s", os.environ.get("GCP_PROJECT", ""))
log.error("")
log.error("The free tier (1 TB/month) is sufficient — no credit card needed.")
sys.exit(1)
def list_bq_tables(bq_client):
"""List all tables in the basedosdados BigQuery project."""
log.info("Discovering tables in BigQuery project: %s", SOURCE_PROJECT)
tables = {}
try:
datasets = list(bq_client.list_datasets())
log.info("Found %d datasets", len(datasets))
except Exception as e:
log.error("Failed to list datasets: %s", e)
sys.exit(1)
for dataset in datasets:
try:
tables_list = list(
bq_client.list_tables(
f"{SOURCE_PROJECT}.{dataset.dataset_id}",
max_results=10000,
)
)
for t in tables_list:
tables[f"{dataset.dataset_id}.{t.table_id}"] = {
"dataset": dataset.dataset_id,
"table": t.table_id,
"full_id": f"{SOURCE_PROJECT}.{dataset.dataset_id}.{t.table_id}",
"schema": [f.name for f in t.schema] if t.schema else [],
"num_bytes": t.num_bytes,
"num_rows": t.num_rows,
}
except Exception as e:
log.warning("Failed to list tables in dataset %s: %s", dataset.dataset_id, e)
log.info("Total BigQuery tables discovered: %d", len(tables))
return tables
def list_s3_tables(s3_client, bucket):
"""List datasets/tables already exported to S3."""
log.info("Discovering tables already in S3 bucket: %s", bucket)
table_files = defaultdict(lambda: defaultdict(list))
try:
paginator = s3_client.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket):
for obj in page.get("Contents", []):
key = obj["Key"]
if not key.endswith(".parquet"):
continue
parts = key.split("/")
if len(parts) >= 3:
dataset, table = parts[0], parts[1]
table_files[dataset][table].append(key)
except Exception as e:
log.warning("S3 listing error (may be empty bucket): %s", e)
tables = {}
for dataset, t_dict in table_files.items():
for table, files in t_dict.items():
tables[f"{dataset}.{table}"] = files
log.info("Total S3 tables discovered: %d", len(tables))
return tables
def parse_missing_tables_from_md(filepath):
"""Parse the missing tables from tasks/datasets_to_scrap.md.
Returns a dict mapping 'dataset.table' -> description.
Falls back to None (use all non-S3 tables) if file not found.
"""
if not os.path.exists(filepath):
log.warning("Missing file %s, using all non-S3 tables", filepath)
return None
log.info("Parsing missing tables from %s", filepath)
with open(filepath) as f:
content = f.read()
missing = {}
lines = content.split("\n")
i = 0
def next_nonempty(lines, i):
while i < len(lines) and not lines[i].strip():
i += 1
return i
while i < len(lines):
line = lines[i].strip()
# Find the Basedosdados.org section
if "Basedosdados.org" in line and "Not in basedosdados.duckdb" in line:
log.info("Found Basedosdados.org section at line %d", i + 1)
i += 1
break
i += 1
# Now parse table entries
while i < len(lines):
line = lines[i].strip()
# End of section only on top-level ## headers, not ### subsections
if line.startswith("## "):
break
# Skip separators and empty lines
if not line or line.startswith("---") or "|---" in line:
i += 1
continue
# Find rows with backtick-wrapped dataset names (e.g. | `br_abrinq_oca` | ...)
if "`" in line and "|" in line:
# Split by pipe, strip whitespace and backticks
parts = [p.strip().strip("`").strip() for p in line.split("|")]
# Filter empty parts
parts = [p for p in parts if p]
if len(parts) >= 2:
dataset_raw = parts[0]
# Check if it looks like a dataset name (br_*, eu_*, mundo_*, etc.)
is_dataset = any(
dataset_raw.startswith(prefix)
for prefix in ("br_", "eu_", "mundo_", "nl_", "world_")
)
if is_dataset:
# parts[1] contains the missing table names (comma-separated)
tables_raw = parts[1]
for tbl in tables_raw.split(","):
tbl = tbl.strip()
# Clean up: remove parenthetical notes, trailing text
if "(" in tbl:
tbl = tbl.split("(")[0].strip()
if tbl and not tbl.startswith("-"):
missing[f"{dataset_raw}.{tbl}"] = f"from {filepath}"
i += 1
log.info("Parsed %d missing table references from MD", len(missing))
return missing if missing else None
def compute_missing_tables(bq_tables, s3_tables, md_missing):
"""Compute which tables need to be synced."""
if md_missing is None:
log.info("No MD file, computing diff: BQ - S3")
return [
(table_id, info)
for table_id, info in bq_tables.items()
if table_id not in s3_tables
]
log.info("Computing sync targets: MD missing tables not in S3")
targets = []
for key, info in bq_tables.items():
if key in s3_tables:
continue
if key in md_missing:
targets.append((key, info))
else:
# Table not in S3 but not in MD missing list
# Check if its dataset is partially covered
dataset = info["dataset"]
table = info["table"]
# If any table from this dataset is in MD missing, include it
dataset_in_md = any(
k.startswith(f"{dataset}.") and k.split(".", 1)[1] in md_missing
for k in bq_tables
)
if not dataset_in_md:
targets.append((key, info))
return targets
def estimate_size_mb(num_bytes):
"""Estimate size in MB."""
if num_bytes is None:
return "?"
return f"{num_bytes / 1_048_576:.1f}"
# ---------------------------------------------------------------------------
# Export logic
# ---------------------------------------------------------------------------
def sync_table(args, table_id, info, dry_run=False):
"""Sync a single table: BQ → parquet → S3 → DuckDB view."""
bq_client, s3_client, bucket = args
dataset = info["dataset"]
table = info["table"]
full_id = info["full_id"]
s3_key_prefix = f"{dataset}/{table}"
if dry_run:
size_mb = estimate_size_mb(info.get("num_bytes"))
return True, f"[DRY] {dataset}.{table} (~{size_mb} MB)"
# Step 1: Query from BigQuery
log.info("Querying %s from BigQuery", full_id)
query = f"SELECT * FROM `{full_id}`"
try:
query_job = bq_client.query(query, location="US")
df = query_job.to_dataframe()
except Exception as e:
return False, f"BQ query failed for {table_id}: {e}"
if df.empty:
return True, f"[SKIP] {table_id} — empty table"
if df.shape[0] > 10_000_000:
log.warning("Table %s has %d rows — may be slow/memory-intensive", table_id, df.shape[0])
# Step 2: Write to parquet in memory, then upload
import io
import pyarrow as pa
import pyarrow.parquet as pq
buffer = io.BytesIO()
table_pa = pa.Table.from_pandas(df)
# Write with zstd compression
writer = pq.ParquetWriter(
buffer,
table_pa.schema,
compression="zstd",
use_dictionary=True,
)
writer.write_table(table_pa)
writer.close()
buffer.seek(0)
s3_key = f"{s3_key_prefix}/{table}.parquet"
log.info("Uploading %s → s3://%s/%s (%s, %d rows)",
table_id, bucket, s3_key,
f"{buffer.getbuffer().nbytes / 1_048_576:.1f} MB",
df.shape[0])
try:
s3_client.upload_fileobj(
buffer,
bucket,
s3_key,
ExtraArgs={"ContentType": "application/octet-stream"},
)
except Exception as e:
return False, f"S3 upload failed for {table_id}: {e}"
log.info("[DONE] %s uploaded to s3://%s/%s", table_id, bucket, s3_key)
return True, f"[DONE] {table_id}"
def update_duckdb_view(env, table_id, info):
"""Register a new table as a DuckDB view over S3 parquet."""
import duckdb
dataset = info["dataset"]
table = info["table"]
bucket = env["HETZNER_S3_BUCKET"]
endpoint = env["HETZNER_S3_ENDPOINT"].removeprefix("https://").removeprefix("http://")
access_key = env["AWS_ACCESS_KEY_ID"]
secret_key = env["AWS_SECRET_ACCESS_KEY"]
# S3 path
s3_path = f"s3://{bucket}/{dataset}/{table}/{table}.parquet"
try:
con = duckdb.connect("basedosdados.duckdb", read_only=False)
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute(f"SET s3_endpoint='{endpoint}';")
con.execute(f"SET s3_access_key_id='{access_key}';")
con.execute(f"SET s3_secret_access_key='{secret_key}';")
con.execute(f"SET s3_url_style='path';")
con.execute(f"CREATE SCHEMA IF NOT EXISTS {dataset}")
con.execute(f"""
CREATE OR REPLACE VIEW {dataset}.{table} AS
SELECT * FROM read_parquet('{s3_path}', hive_partitioning=true, union_by_name=true)
""")
con.close()
log.info("[DUCKDB] View created: %s.%s", dataset, table)
return True, None
except Exception as e:
log.error("[DUCKDB] Failed to create view %s.%s: %s", dataset, table, e)
return False, str(e)
def run_sync(targets, args, env, dry_run=False, resume=False):
"""Run the sync for all target tables."""
s3_client = get_s3_client(env)
bq_client = get_bq_client()
# Load done/failed tracking
done_set = set()
if resume:
if os.path.exists(DONE_FILE):
with open(DONE_FILE) as f:
done_set = {l.strip() for l in f if l.strip()}
log.info("Resuming: %d tables already done", len(done_set))
failed_count = 0
done_count = 0
# Filter out already-done tables
targets = [(tid, info) for tid, info in targets if tid not in done_set]
if not targets:
log.info("No tables to sync.")
return 0, 0
log.info("Syncing %d tables...", len(targets))
for i, (table_id, info) in enumerate(targets, 1):
log.info("--- [%d/%d] Syncing %s ---", i, len(targets), table_id)
# Sync BQ → S3
ok, msg = sync_table(
(bq_client, s3_client, env["HETZNER_S3_BUCKET"]),
table_id,
info,
dry_run=dry_run,
)
log.info(msg)
if dry_run:
continue
if not ok:
with open(FAILED_FILE, "a") as f:
f.write(f"{table_id}\t{msg}\n")
failed_count += 1
continue
if "empty" in msg.lower():
continue
# Update DuckDB view
ok, err = update_duckdb_view(env, table_id, info)
if not ok:
with open(FAILED_FILE, "a") as f:
f.write(f"{table_id}\tDUCKDB: {err}\n")
# Mark done
with open(DONE_FILE, "a") as f:
f.write(f"{table_id}\n")
done_count += 1
return done_count, failed_count
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description="Sync missing BQ tables to S3")
parser.add_argument("--dry-run", action="store_true", help="List tables without syncing")
parser.add_argument("--resume", action="store_true", help="Resume from last run")
args = parser.parse_args()
env = load_env()
dry_run = args.dry_run
if dry_run:
log.info("=== DRY RUN MODE ===")
# Step 1: List BigQuery tables
bq_client = get_bq_client()
bq_tables = list_bq_tables(bq_client)
# Step 2: List S3 tables
s3_client = get_s3_client(env)
s3_tables = list_s3_tables(s3_client, env["HETZNER_S3_BUCKET"])
# Step 3: Parse missing tables from MD
md_missing = parse_missing_tables_from_md(MISSING_TABLES_FILE)
# Step 4: Compute targets
targets = compute_missing_tables(bq_tables, s3_tables, md_missing)
if not targets:
log.info("No tables to sync.")
return
log.info("")
log.info("============================================")
log.info(" Tables to sync: %d", len(targets))
log.info("============================================")
for i, (table_id, info) in enumerate(targets, 1):
size_mb = estimate_size_mb(info.get("num_bytes"))
md_note = md_missing.get(table_id, "")
log.info(" [%d] %-50s %6s MB %s", i, table_id, size_mb, md_note)
log.info("")
if dry_run:
total_bytes = sum(info.get("num_bytes", 0) or 0 for _, info in targets)
total_gb = total_bytes / 1_073_741_824
log.info("Total estimated size: %.2f GB (BigQuery compressed bytes)", total_gb)
log.info("Run without --dry-run to start syncing.")
return
# Step 5: Run sync
log.info("Starting sync...")
done_count, failed_count = run_sync(targets, None, env, dry_run=False, resume=args.resume)
log.info("")
log.info("============================================")
log.info(" Sync complete!")
log.info(" Done: %d tables", done_count)
log.info(" Failed: %d tables", failed_count)
log.info(" Log: %s", LOG_FILE)
log.info("============================================")
if failed_count > 0:
log.info("Failed tables: see %s", FAILED_FILE)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -143,11 +143,36 @@ Sources from https://github.com/jxnxts/mcp-brasil not in `basedosdados.duckdb`.
| INPE | `inpe` | none | `https://terrabrasilis.dpi.inpe.br/queimadas/bdqueimadas-data-service` | JSON | | INPE | `inpe` | none | `https://terrabrasilis.dpi.inpe.br/queimadas/bdqueimadas-data-service` | JSON |
| Tabua Mares | `tabua_mares` | none | `https://tabuademares.com/api/v2` | JSON | | Tabua Mares | `tabua_mares` | none | `https://tabuademares.com/api/v2` | JSON |
## Basedosdados.org — Not in basedosdados.duckdb (232 tables) ## Basedosdados.org — Not in basedosdados.duckdb
Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and thus in your duckdb). The following datasets have **zero or partial** tables in duckdb. Basedosdados.org has **765 tables** on BigQuery, **~533** on S3. The remaining gap:
### Full datasets — no tables in duckdb - **2 TABLEs** need `bq extract` → GCS → S3 (waiting on GCP billing restore)
- **~230 are VIEWs** → need `bq query` to materialize, then `bq extract` (or streaming write to S3)
- **3 tables MISSING** from BQ entirely (br_bcb_sicor microdados_* don't exist)
### Need export — 2 TABLEs blocked on GCP billing
| Dataset | Table | BQ Type | Notes |
|---------|-------|---------|-------|
| `br_bcb_taxa_cambio` | taxa_cambio | TABLE | ✅ `bq extract` works |
| `br_bcb_taxa_selic` | taxa_selic | TABLE | ✅ `bq extract` works |
### Already on S3 (no action needed)
| Dataset | Tables |
|---------|--------|
| `br_bd_metadados` | bigquery_tables, prefect_flow_runs |
| `br_fbsp_absp` | uf, violencia_escola |
| `br_ibge_estadic` | dicionario |
| `br_camara_dados_abertos` | all 33 tables (222 parquet files) |
| `br_me_rais` | dicionario, microdados_estabelecimentos, microdados_vinculos |
### ~230 VIEWs — need bq query materialization pipeline
Cannot `bq extract` directly. Need to: (1) materialize via `bq query --destination_table`, or (2) stream via Python Arrow → S3 directly.
#### Full datasets (all VIEWs)
| Dataset | Tables missing | Notes | | Dataset | Tables missing | Notes |
|---------|----------------|-------| |---------|----------------|-------|
@@ -157,7 +182,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
| `br_anvisa_medicamentos_industrializados` | microdados | | | `br_anvisa_medicamentos_industrializados` | microdados | |
| `br_ba_feiradesantana_camara_leis` | microdados | | | `br_ba_feiradesantana_camara_leis` | microdados | |
| `br_bd_diretorios_data_tempo` | tempo, data, ano, mes, dia, hora, bimestre, trimestre, semestre, minuto, segundo | Directory of time dimensions | | `br_bd_diretorios_data_tempo` | tempo, data, ano, mes, dia, hora, bimestre, trimestre, semestre, minuto, segundo | Directory of time dimensions |
| `br_bd_metadados` | external_links, information_requests, organizations, prefect_flows, resources, tables | BD metadata catalog | | `br_bd_metadados` | external_links, information_requests, organizations, resources, tables | |
| `br_bd_vizinhanca` | municipio, uf | | | `br_bd_vizinhanca` | municipio, uf | |
| `br_caixa_sorteios` | megasena | | | `br_caixa_sorteios` | megasena | |
| `br_camara_dados_abertos` | sigla_partido | | | `br_camara_dados_abertos` | sigla_partido | |
@@ -179,7 +204,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
| `br_ieps_saude` | brasil, macrorregiao, municipio, regiao_saude, uf | | | `br_ieps_saude` | brasil, macrorregiao, municipio, regiao_saude, uf | |
| `br_imprensa_nacional_dou` | secao_1, secao_2, secao_3 | Official gazette sections | | `br_imprensa_nacional_dou` | secao_1, secao_2, secao_3 | Official gazette sections |
| `br_ipea_acesso_oportunidades` | estatisticas_2019, indicadores_2019 | | | `br_ipea_acesso_oportunidades` | estatisticas_2019, indicadores_2019 | |
| `br_mapbiomas_estatisticas` | classe, cobertura_municipio_classe, cobertura_uf_classe, transicao_municipio_de_para_anual/decenal/quinquenal, transicao_uf_de_para_anual/decenal/quinquenal | | | `br_mapbiomas_estatisticas` | classe, cobertura_municipio_classe, cobertura_uf_classe, transicao_*(anual/decenal/quinquenal) | |
| `br_mc_indicadores` | transferencias_municipio | | | `br_mc_indicadores` | transferencias_municipio | |
| `br_me_clima_organizacional` | microdados | | | `br_me_clima_organizacional` | microdados | |
| `br_me_estoque_divida_publica` | microdados | | | `br_me_estoque_divida_publica` | microdados | |
@@ -188,7 +213,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
| `br_me_siape` | servidores_executivo_federal | | | `br_me_siape` | servidores_executivo_federal | |
| `br_me_siorg` | remuneracao | | | `br_me_siorg` | remuneracao | |
| `br_mma_extincao` | fauna_ameacada, flora_ameacada | | | `br_mma_extincao` | fauna_ameacada, flora_ameacada | |
| `br_mobilidados_indicadores` | 11 tables (comprometimento_renda_tarifa_transp_publico, proporcao_*, taxa_motorizacao, etc.) | | | `br_mobilidados_indicadores` | 11 tables | |
| `br_ms_atencao_basica` | municipio | | | `br_ms_atencao_basica` | municipio | |
| `br_ms_imunizacoes` | municipio | | | `br_ms_imunizacoes` | municipio | |
| `br_ons_energia_armazenada` | subsistemas | | | `br_ons_energia_armazenada` | subsistemas | |
@@ -219,18 +244,16 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
| `world_ti_corruption_perception` | country | | | `world_ti_corruption_perception` | country | |
| `world_wb_wwbi` | country_finance, country_indicators | | | `world_wb_wwbi` | country_finance, country_indicators | |
### Partial datasets — some tables in duckdb, some missing #### Partial datasets — missing tables (all VIEWs, except where noted)
| Dataset | Missing tables | In duckdb | | Dataset | Missing tables | In duckdb |
|---------|----------------|-----------| |---------|----------------|-----------|
| `br_anatel_banda_larga_fixa` | backhaul, pble | densidade_*, microdados | | `br_anatel_banda_larga_fixa` | backhaul, pble | densidade_*, microdados |
| `br_bcb_sicor` | microdados_liberacao, microdados_operacao, microdados_saldo | dicionario, liberacao, operacao, saldo, recurso_publico_* | | `br_bcb_sicor` | microdados_liberacao, microdados_operacao, microdados_saldo | dicionario, liberacao, operacao, saldo, + 5 more TABLEs |
| `br_bcb_taxa_cambio` | taxa_cambio | — (ACCESS_DENIED) |
| `br_bcb_taxa_selic` | taxa_selic | — (ACCESS_DENIED) |
| `br_ibge_pib` | brasil_antigo, municipio_antigo, regiao_antigo, uf, uf_antigo | gini, municipio | | `br_ibge_pib` | brasil_antigo, municipio_antigo, regiao_antigo, uf, uf_antigo | gini, municipio |
| `br_ibge_pnad_covid` | microdados | dicionario | | `br_ibge_pnad_covid` | microdados | dicionario |
| `br_ibge_pnadc` | ano_brasil_grupo_idade, ano_brasil_raca_cor, ano_municipio_*, ano_regiao_*, ano_uf_* (cross-tabs) | dicionario, educacao, microdados, rendimentos_outras_fontes | | `br_ibge_pnadc` | 10 cross-tab tables (ano_*) | dicionario, educacao, microdados, rendimentos_outras_fontes |
| `br_ibge_pof` | all 17 tables (morador, domicilio, despesa_*, consumo_*, etc.) | none | | `br_ibge_pof` | all 17 tables (morador_*, domicilio_*, despesa_*, consumo_*, etc.) | none |
| `br_inep_ana` | aluno, escola, prova | dicionario | | `br_inep_ana` | aluno, escola, prova | dicionario |
| `br_inep_censo_escolar` | docente, matricula | dicionario, escola, turma | | `br_inep_censo_escolar` | docente, matricula | dicionario, escola, turma |
| `br_inep_formacao_docente` | brasil, escola, municipio, regiao, uf | dicionario | | `br_inep_formacao_docente` | brasil, escola, municipio, regiao, uf | dicionario |
@@ -238,8 +261,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
| `br_inep_indicadores_educacionais` | escola_nivel_socioeconomico, fluxo_educacao_superior | all others | | `br_inep_indicadores_educacionais` | escola_nivel_socioeconomico, fluxo_educacao_superior | all others |
| `br_inmet_bdmep` | estacao | microdados | | `br_inmet_bdmep` | estacao | microdados |
| `br_me_caged` | microdados_antigos, microdados_antigos_ajustes | dicionario, microdados_movimentacao* | | `br_me_caged` | microdados_antigos, microdados_antigos_ajustes | dicionario, microdados_movimentacao* |
| `br_me_cno` | microdados, microdados_cnae, microdados_vinculo | dicionario, microdados | | `br_me_cno` | microdados_cnae, microdados_vinculo | dicionario, microdados |
| `br_me_rais` | all tables | dicionario, microdados_estabelecimentos, microdados_vinculos |
| `br_mec_prouni` | microdados | dicionario | | `br_mec_prouni` | microdados | dicionario |
| `br_ms_sim` | municipio, municipio_causa, municipio_causa_idade, municipio_causa_idade_sexo_raca | dicionario, microdados | | `br_ms_sim` | municipio, municipio_causa, municipio_causa_idade, municipio_causa_idade_sexo_raca | dicionario, microdados |
| `br_ms_sinan` | microdados_violencia | dicionario, microdados_dengue, microdados_influenza_srag | | `br_ms_sinan` | microdados_violencia | dicionario, microdados_dengue, microdados_influenza_srag |
@@ -247,3 +269,13 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
| `br_seeg_emissoes` | brasil | dicionario, municipio, uf | | `br_seeg_emissoes` | brasil | dicionario, municipio, uf |
| `br_tse_eleicoes` | local_secao | all others | | `br_tse_eleicoes` | local_secao | all others |
| `world_oecd_pisa` | dictionary, school_summary, student_summary | student | | `world_oecd_pisa` | dictionary, school_summary, student_summary | student |
### Tables that don't exist in BigQuery (3)
These were listed in datasets_to_scrap but actually don't exist in `basedosdados`:
| Dataset | Table |
|---------|-------|
| `br_bcb_sicor` | microdados_liberacao |
| `br_bcb_sicor` | microdados_operacao |
| `br_bcb_sicor` | microdados_saldo |

270
tasks/missing_tables.txt Normal file
View File

@@ -0,0 +1,270 @@
br_abrinq_oca.municipio_primeira_infancia
br_ana_atlas_esgotos.municipio
br_ana_reservatorios.sin
br_anvisa_medicamentos_industrializados.microdados
br_ba_feiradesantana_camara_leis.microdados
br_bd_diretorios_data_tempo.ano
br_bd_diretorios_data_tempo.bimestre
br_bd_diretorios_data_tempo.data
br_bd_diretorios_data_tempo.dia
br_bd_diretorios_data_tempo.hora
br_bd_diretorios_data_tempo.mes
br_bd_diretorios_data_tempo.minuto
br_bd_diretorios_data_tempo.segundo
br_bd_diretorios_data_tempo.semestre
br_bd_diretorios_data_tempo.tempo
br_bd_diretorios_data_tempo.trimestre
br_bd_metadados.bigquery_tables
br_bd_metadados.external_links
br_bd_metadados.information_requests
br_bd_metadados.organizations
br_bd_metadados.prefect_flow_runs
br_bd_metadados.resources
br_bd_metadados.tables
br_bd_vizinhanca.municipio
br_bd_vizinhanca.uf
br_caixa_sorteios.megasena
br_camara_dados_abertos.deputado
br_camara_dados_abertos.deputado_ocupacao
br_camara_dados_abertos.deputado_profissao
br_camara_dados_abertos.despesa
br_camara_dados_abertos.evento
br_camara_dados_abertos.evento_orgao
br_camara_dados_abertos.evento_presenca_deputado
br_camara_dados_abertos.evento_requerimento
br_camara_dados_abertos.frente
br_camara_dados_abertos.frente_deputado
br_camara_dados_abertos.funcionario
br_camara_dados_abertos.legislatura
br_camara_dados_abertos.legislatura_mesa
br_camara_dados_abertos.licitacao
br_camara_dados_abertos.licitacao_contrato
br_camara_dados_abertos.licitacao_item
br_camara_dados_abertos.licitacao_pedido
br_camara_dados_abertos.licitacao_proposta
br_camara_dados_abertos.orgao
br_camara_dados_abertos.orgao_deputado
br_camara_dados_abertos.proposicao_autor
br_camara_dados_abertos.proposicao_microdados
br_camara_dados_abertos.proposicao_tema
br_camara_dados_abertos.sigla_partido
br_camara_dados_abertos.votacao
br_camara_dados_abertos.votacao_objeto
br_camara_dados_abertos.votacao_orientacao_bancada
br_camara_dados_abertos.votacao_parlamentar
br_camara_dados_abertos.votacao_proposicao
br_capes_bolsas.mobilidade_internacional
br_cgu_ebt.municipio
br_cgu_ebt.uf
br_cgu_fef.microdados
br_cgu_fef.municipios_sorteados
br_cgu_fef.sorteio
br_cgu_pessoal_executivo_federal.terceirizados
br_clp_ranking_competitividade.nota_geral_municipio
br_clp_ranking_competitividade.nota_geral_uf
br_cnj_estatisticas_poder_judiciario.recursos_financeiros
br_fbsp_absp.municipio
br_fbsp_absp.uf
br_fbsp_absp.violencia_escola
br_firjan_ifgf.ranking
br_ggb_relatorio_lgbtqi.brasil
br_ggb_relatorio_lgbtqi.causa_obito
br_ggb_relatorio_lgbtqi.grupo_lgbtqia
br_ggb_relatorio_lgbtqi.local
br_ggb_relatorio_lgbtqi.raca_cor
br_ibge_amc.municipio_de_para
br_ibge_cbo_2002.perfil_ocupacional
br_ibge_cbo_2002.sinonimo
br_ibge_estadic.comunicacao_informatica
br_ibge_estadic.dicionario
br_ibge_estadic.educacao
br_ibge_estadic.governanca
br_ibge_estadic.indicadores_perfil_gestor
br_ibge_estadic.indicadores_quantidade_vinculo
br_ibge_estadic.politica_mulher
br_ibge_estadic.recursos_humanos
br_ibge_ipp.mes_categoria_economica
br_ibge_ipp.mes_grupo_industrial
br_ibge_ipp.mes_industria_atividade
br_ibge_ipp.mes_industria_extrativa
br_ibge_ipp.mes_industria_geral
br_ibge_ipp.mes_industria_transformacao
br_ibge_munic.indicadores_perfil_gestor
br_ibge_munic.indicadores_quantidade_vinculo
br_ibge_nomes_brasil.quantidade_municipio_nome_2010
br_ieps_saude.brasil
br_ieps_saude.macrorregiao
br_ieps_saude.municipio
br_ieps_saude.regiao_saude
br_ieps_saude.uf
br_imprensa_nacional_dou.secao_1
br_imprensa_nacional_dou.secao_2
br_imprensa_nacional_dou.secao_3
br_ipea_acesso_oportunidades.estatisticas_2019
br_ipea_acesso_oportunidades.indicadores_2019
br_mapbiomas_estatisticas.classe
br_mapbiomas_estatisticas.cobertura_municipio_classe
br_mapbiomas_estatisticas.cobertura_uf_classe
br_mapbiomas_estatisticas.transicao_municipio_de_para_anual
br_mapbiomas_estatisticas.transicao_municipio_de_para_decenal
br_mapbiomas_estatisticas.transicao_municipio_de_para_quinquenal
br_mapbiomas_estatisticas.transicao_uf_de_para_anual
br_mapbiomas_estatisticas.transicao_uf_de_para_decenal
br_mapbiomas_estatisticas.transicao_uf_de_para_quinquenal
br_mc_indicadores.transferencias_municipio
br_me_clima_organizacional.microdados
br_me_estoque_divida_publica.microdados
br_me_exportadoras_importadoras.dicionario
br_me_exportadoras_importadoras.estabelecimentos
br_me_pensionistas.microdados
br_me_siape.servidores_executivo_federal
br_me_siorg.remuneracao
br_mma_extincao.fauna_ameacada
br_mma_extincao.flora_ameacada
br_mobilidados_indicadores.comprometimento_renda_tarifa_transp_publico
br_mobilidados_indicadores.divisao_modal
br_mobilidados_indicadores.emissao_co2_material_particulado
br_mobilidados_indicadores.proporcao_domicilios_infra_urbana
br_mobilidados_indicadores.proporcao_mortes_negras_acidente_transporte
br_mobilidados_indicadores.proporcao_pessoas_prox_infra_cicloviaria
br_mobilidados_indicadores.proporcao_pessoas_proximas_pnt
br_mobilidados_indicadores.taxa_motorizacao
br_mobilidados_indicadores.tempo_deslocamento_casa_trabalho
br_mobilidados_indicadores.transporte_media_alta_capacidade
br_ms_atencao_basica.municipio
br_ms_imunizacoes.municipio
br_ons_energia_armazenada.subsistemas
br_rj_rio_de_janeiro_ipp_ips.dimensoes_componentes
br_rj_rio_de_janeiro_ipp_ips.indicadores
br_rj_tce_iegm.indicadores
br_senado_cpipandemia.discursos
br_sgp_informacao.despesas_cartao_corporativo
br_sp_alesp.assessores_lideranca
br_sp_alesp.assessores_parlamentares
br_sp_alesp.deputado
br_sp_alesp.despesas_gabinete
br_sp_alesp.despesas_gabinete_atual
br_sp_gov_orcamento.despesa
br_sp_gov_orcamento.receita_arrecadada
br_sp_gov_orcamento.receita_prevista
br_sp_gov_ssp.ocorrencias_registradas
br_sp_gov_ssp.produtividade_policial
br_sp_saopaulo_dieese_icv.ano
br_sp_seduc_fluxo_escolar.escola
br_sp_seduc_fluxo_escolar.municipio
br_sp_seduc_idesp.diretoria
br_sp_seduc_idesp.escola
br_sp_seduc_idesp.uf
br_sp_seduc_inse.escola
br_tpe_classificacao_saeb.categoria
eu_fra_lgbt.consciencia_direitos
eu_fra_lgbt.cotidiano
eu_fra_lgbt.discriminacao
eu_fra_lgbt.especifico_transgenero
eu_fra_lgbt.violencia_abuso
mundo_bm_learning_poverty.pais
mundo_kaggle_olimpiadas.microdados
mundo_onu_adh.brasil
mundo_onu_adh.municipio
mundo_onu_adh.uf
mundo_transrespect_transphobia.causa_obito
mundo_transrespect_transphobia.local
mundo_transrespect_transphobia.pais
nl_ug_pwt.microdados
world_fao_production.country_group
world_fao_production.crop_livestock
world_fao_production.dictionary
world_fao_production.element
world_fao_production.item
world_fao_production.item_group
world_fao_production.production_indices
world_fao_production.value_agricultural_production
world_fifa_women_world_cup.matches
world_fifa_worldcup.award_winners
world_fifa_worldcup.matches
world_fifa_worldcup.players
world_fifa_worldcup.teams
world_fifa_worldcup.tournaments
world_gsps_consortium_gsps.global_indicators
world_slave_voyages_consortium_slave_trade.transatlantic
world_spi_spi.global_indicators
world_ti_corruption_perception.country
world_wb_wwbi.country_finance
world_wb_wwbi.country_indicators
br_anatel_banda_larga_fixa.backhaul
br_anatel_banda_larga_fixa.pble
br_bcb_sicor.microdados_liberacao
br_bcb_sicor.microdados_operacao
br_bcb_sicor.microdados_saldo
br_bcb_taxa_cambio.taxa_cambio
br_bcb_taxa_selic.taxa_selic
br_ibge_pib.brasil_antigo
br_ibge_pib.municipio_antigo
br_ibge_pib.regiao_antigo
br_ibge_pib.uf
br_ibge_pib.uf_antigo
br_ibge_pnad_covid.microdados
br_ibge_pnadc.ano_brasil_grupo_idade
br_ibge_pnadc.ano_brasil_raca_cor
br_ibge_pnadc.ano_municipio_grupo_idade
br_ibge_pnadc.ano_municipio_raca_cor
br_ibge_pnadc.ano_regiao_grupo_idade
br_ibge_pnadc.ano_regiao_metropolitana_grupo_idade
br_ibge_pnadc.ano_regiao_metropolitana_raca_cor
br_ibge_pnadc.ano_regiao_raca_cor
br_ibge_pnadc.ano_uf_grupo_idade
br_ibge_pnadc.ano_uf_raca_cor
br_ibge_pof.aluguel_estimado_2017
br_ibge_pof.cadastro_de_produtos_2017
br_ibge_pof.caderneta_coletiva_2017
br_ibge_pof.caracteristicas_dieta_2017
br_ibge_pof.condicoes_vida_2017
br_ibge_pof.consumo_alimentar_2017
br_ibge_pof.despesa_coletiva_2017
br_ibge_pof.despesa_individual_2017
br_ibge_pof.domicilio_2017
br_ibge_pof.inventario_2017
br_ibge_pof.morador_2017
br_ibge_pof.outros_rendimentos_2017
br_ibge_pof.rendimento_trabalho_2017
br_ibge_pof.restricao_saude_2017
br_ibge_pof.servico_nao_monetario_pof2_2017
br_ibge_pof.servico_nao_monetario_pof4_2017
br_inep_ana.aluno
br_inep_ana.escola
br_inep_ana.prova
br_inep_censo_escolar.docente
br_inep_censo_escolar.matricula
br_inep_formacao_docente.brasil
br_inep_formacao_docente.escola
br_inep_formacao_docente.municipio
br_inep_formacao_docente.regiao
br_inep_formacao_docente.uf
br_inep_indicador_nivel_socioeconomico.brasil
br_inep_indicador_nivel_socioeconomico.municipio
br_inep_indicador_nivel_socioeconomico.uf
br_inep_indicadores_educacionais.escola_nivel_socioeconomico
br_inep_indicadores_educacionais.fluxo_educacao_superior
br_inmet_bdmep.estacao
br_me_caged.microdados_antigos
br_me_caged.microdados_antigos_ajustes
br_me_cno.microdados_cnae
br_me_cno.microdados_vinculo
br_me_rais.dicionario
br_me_rais.microdados_estabelecimentos
br_me_rais.microdados_vinculos
br_mec_prouni.microdados
br_ms_sim.municipio
br_ms_sim.municipio_causa
br_ms_sim.municipio_causa_idade
br_ms_sim.municipio_causa_idade_sexo_raca
br_ms_sinan.microdados_violencia
br_ms_vacinacao_covid19.microdados
br_ms_vacinacao_covid19.microdados_estabelecimento
br_ms_vacinacao_covid19.microdados_paciente
br_ms_vacinacao_covid19.microdados_vacinacao
br_seeg_emissoes.brasil
br_tse_eleicoes.local_secao
world_oecd_pisa.dictionary
world_oecd_pisa.school_summary
world_oecd_pisa.student_summary

2
tasks/pending_tables.txt Normal file
View File

@@ -0,0 +1,2 @@
br_bcb_taxa_cambio.taxa_cambio
br_bcb_taxa_selic.taxa_selic

View File

@@ -0,0 +1,229 @@
br_abrinq_oca.municipio_primeira_infancia
br_ana_atlas_esgotos.municipio
br_ana_reservatorios.sin
br_anvisa_medicamentos_industrializados.microdados
br_ba_feiradesantana_camara_leis.microdados
br_bd_diretorios_data_tempo.ano
br_bd_diretorios_data_tempo.bimestre
br_bd_diretorios_data_tempo.data
br_bd_diretorios_data_tempo.dia
br_bd_diretorios_data_tempo.hora
br_bd_diretorios_data_tempo.mes
br_bd_diretorios_data_tempo.minuto
br_bd_diretorios_data_tempo.segundo
br_bd_diretorios_data_tempo.semestre
br_bd_diretorios_data_tempo.tempo
br_bd_diretorios_data_tempo.trimestre
br_bd_metadados.external_links
br_bd_metadados.information_requests
br_bd_metadados.organizations
br_bd_metadados.resources
br_bd_metadados.tables
br_bd_vizinhanca.municipio
br_bd_vizinhanca.uf
br_caixa_sorteios.megasena
br_camara_dados_abertos.sigla_partido
br_capes_bolsas.mobilidade_internacional
br_cgu_ebt.municipio
br_cgu_ebt.uf
br_cgu_fef.microdados
br_cgu_fef.municipios_sorteados
br_cgu_fef.sorteio
br_cgu_pessoal_executivo_federal.terceirizados
br_clp_ranking_competitividade.nota_geral_municipio
br_clp_ranking_competitividade.nota_geral_uf
br_cnj_estatisticas_poder_judiciario.recursos_financeiros
br_fbsp_absp.municipio
br_firjan_ifgf.ranking
br_ggb_relatorio_lgbtqi.brasil
br_ggb_relatorio_lgbtqi.causa_obito
br_ggb_relatorio_lgbtqi.grupo_lgbtqia
br_ggb_relatorio_lgbtqi.local
br_ggb_relatorio_lgbtqi.raca_cor
br_ibge_amc.municipio_de_para
br_ibge_cbo_2002.perfil_ocupacional
br_ibge_cbo_2002.sinonimo
br_ibge_estadic.comunicacao_informatica
br_ibge_estadic.educacao
br_ibge_estadic.governanca
br_ibge_estadic.indicadores_perfil_gestor
br_ibge_estadic.indicadores_quantidade_vinculo
br_ibge_estadic.politica_mulher
br_ibge_estadic.recursos_humanos
br_ibge_ipp.mes_categoria_economica
br_ibge_ipp.mes_grupo_industrial
br_ibge_ipp.mes_industria_atividade
br_ibge_ipp.mes_industria_extrativa
br_ibge_ipp.mes_industria_geral
br_ibge_ipp.mes_industria_transformacao
br_ibge_munic.indicadores_perfil_gestor
br_ibge_munic.indicadores_quantidade_vinculo
br_ibge_nomes_brasil.quantidade_municipio_nome_2010
br_ibge_pib.brasil_antigo
br_ibge_pib.municipio_antigo
br_ibge_pib.regiao_antigo
br_ibge_pib.uf
br_ibge_pib.uf_antigo
br_ibge_pnad_covid.microdados
br_ibge_pnadc.ano_brasil_grupo_idade
br_ibge_pnadc.ano_brasil_raca_cor
br_ibge_pnadc.ano_municipio_grupo_idade
br_ibge_pnadc.ano_municipio_raca_cor
br_ibge_pnadc.ano_regiao_grupo_idade
br_ibge_pnadc.ano_regiao_metropolitana_grupo_idade
br_ibge_pnadc.ano_regiao_metropolitana_raca_cor
br_ibge_pnadc.ano_regiao_raca_cor
br_ibge_pnadc.ano_uf_grupo_idade
br_ibge_pnadc.ano_uf_raca_cor
br_ibge_pof.aluguel_estimado_2017
br_ibge_pof.cadastro_de_produtos_2017
br_ibge_pof.caderneta_coletiva_2017
br_ibge_pof.caracteristicas_dieta_2017
br_ibge_pof.condicoes_vida_2017
br_ibge_pof.consumo_alimentar_2017
br_ibge_pof.despesa_coletiva_2017
br_ibge_pof.despesa_individual_2017
br_ibge_pof.domicilio_2017
br_ibge_pof.inventario_2017
br_ibge_pof.morador_2017
br_ibge_pof.outros_rendimentos_2017
br_ibge_pof.rendimento_trabalho_2017
br_ibge_pof.restricao_saude_2017
br_ibge_pof.servico_nao_monetario_pof2_2017
br_ibge_pof.servico_nao_monetario_pof4_2017
br_ieps_saude.brasil
br_ieps_saude.macrorregiao
br_ieps_saude.municipio
br_ieps_saude.regiao_saude
br_ieps_saude.uf
br_imprensa_nacional_dou.secao_1
br_imprensa_nacional_dou.secao_2
br_imprensa_nacional_dou.secao_3
br_ipea_acesso_oportunidades.estatisticas_2019
br_ipea_acesso_oportunidades.indicadores_2019
br_mapbiomas_estatisticas.classe
br_mapbiomas_estatisticas.cobertura_municipio_classe
br_mapbiomas_estatisticas.cobertura_uf_classe
br_mapbiomas_estatisticas.transicao_municipio_de_para_anual
br_mapbiomas_estatisticas.transicao_municipio_de_para_decenal
br_mapbiomas_estatisticas.transicao_municipio_de_para_quinquenal
br_mapbiomas_estatisticas.transicao_uf_de_para_anual
br_mapbiomas_estatisticas.transicao_uf_de_para_decenal
br_mapbiomas_estatisticas.transicao_uf_de_para_quinquenal
br_mc_indicadores.transferencias_municipio
br_me_caged.microdados_antigos
br_me_caged.microdados_antigos_ajustes
br_me_clima_organizacional.microdados
br_me_cno.microdados_cnae
br_me_cno.microdados_vinculo
br_me_estoque_divida_publica.microdados
br_me_exportadoras_importadoras.dicionario
br_me_exportadoras_importadoras.estabelecimentos
br_me_pensionistas.microdados
br_me_siape.servidores_executivo_federal
br_me_siorg.remuneracao
br_mec_prouni.microdados
br_mma_extincao.fauna_ameacada
br_mma_extincao.flora_ameacada
br_mobilidados_indicadores.comprometimento_renda_tarifa_transp_publico
br_mobilidados_indicadores.divisao_modal
br_mobilidados_indicadores.emissao_co2_material_particulado
br_mobilidados_indicadores.proporcao_domicilios_infra_urbana
br_mobilidados_indicadores.proporcao_mortes_negras_acidente_transporte
br_mobilidados_indicadores.proporcao_pessoas_prox_infra_cicloviaria
br_mobilidados_indicadores.proporcao_pessoas_proximas_pnt
br_mobilidados_indicadores.taxa_motorizacao
br_mobilidados_indicadores.tempo_deslocamento_casa_trabalho
br_mobilidados_indicadores.transporte_media_alta_capacidade
br_ms_atencao_basica.municipio
br_ms_imunizacoes.municipio
br_ms_sim.municipio
br_ms_sim.municipio_causa
br_ms_sim.municipio_causa_idade
br_ms_sim.municipio_causa_idade_sexo_raca
br_ms_sinan.microdados_violencia
br_ms_vacinacao_covid19.microdados
br_ms_vacinacao_covid19.microdados_estabelecimento
br_ms_vacinacao_covid19.microdados_paciente
br_ms_vacinacao_covid19.microdados_vacinacao
br_ons_energia_armazenada.subsistemas
br_rj_rio_de_janeiro_ipp_ips.dimensoes_componentes
br_rj_rio_de_janeiro_ipp_ips.indicadores
br_rj_tce_iegm.indicadores
br_seeg_emissoes.brasil
br_senado_cpipandemia.discursos
br_sgp_informacao.despesas_cartao_corporativo
br_sp_alesp.assessores_lideranca
br_sp_alesp.assessores_parlamentares
br_sp_alesp.deputado
br_sp_alesp.despesas_gabinete
br_sp_alesp.despesas_gabinete_atual
br_sp_gov_orcamento.despesa
br_sp_gov_orcamento.receita_arrecadada
br_sp_gov_orcamento.receita_prevista
br_sp_gov_ssp.ocorrencias_registradas
br_sp_gov_ssp.produtividade_policial
br_sp_saopaulo_dieese_icv.ano
br_sp_seduc_fluxo_escolar.escola
br_sp_seduc_fluxo_escolar.municipio
br_sp_seduc_idesp.diretoria
br_sp_seduc_idesp.escola
br_sp_seduc_idesp.uf
br_sp_seduc_inse.escola
br_tpe_classificacao_saeb.categoria
br_tse_eleicoes.local_secao
eu_fra_lgbt.consciencia_direitos
eu_fra_lgbt.cotidiano
eu_fra_lgbt.discriminacao
eu_fra_lgbt.especifico_transgenero
eu_fra_lgbt.violencia_abuso
mundo_bm_learning_poverty.pais
mundo_kaggle_olimpiadas.microdados
mundo_onu_adh.brasil
mundo_onu_adh.municipio
mundo_onu_adh.uf
mundo_transrespect_transphobia.causa_obito
mundo_transrespect_transphobia.local
mundo_transrespect_transphobia.pais
nl_ug_pwt.microdados
world_fao_production.country_group
world_fao_production.crop_livestock
world_fao_production.dictionary
world_fao_production.element
world_fao_production.item
world_fao_production.item_group
world_fao_production.production_indices
world_fao_production.value_agricultural_production
world_fifa_women_world_cup.matches
world_fifa_worldcup.award_winners
world_fifa_worldcup.matches
world_fifa_worldcup.players
world_fifa_worldcup.teams
world_fifa_worldcup.tournaments
world_gsps_consortium_gsps.global_indicators
world_oecd_pisa.dictionary
world_oecd_pisa.school_summary
world_oecd_pisa.student_summary
world_slave_voyages_consortium_slave_trade.transatlantic
world_spi_spi.global_indicators
world_ti_corruption_perception.country
world_wb_wwbi.country_finance
world_wb_wwbi.country_indicators
br_anatel_banda_larga_fixa.backhaul
br_anatel_banda_larga_fixa.pble
br_inep_ana.aluno
br_inep_ana.escola
br_inep_ana.prova
br_inep_censo_escolar.docente
br_inep_censo_escolar.matricula
br_inep_formacao_docente.brasil
br_inep_formacao_docente.escola
br_inep_formacao_docente.municipio
br_inep_formacao_docente.regiao
br_inep_formacao_docente.uf
br_inep_indicador_nivel_socioeconomico.brasil
br_inep_indicador_nivel_socioeconomico.municipio
br_inep_indicador_nivel_socioeconomico.uf
br_inep_indicadores_educacionais.escola_nivel_socioeconomico
br_inep_indicadores_educacionais.fluxo_educacao_superior
br_inmet_bdmep.estacao