refactor: reorganize project structure and fix broken references
- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.) - Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml) - Move basedosdados.duckdb to data/ directory - Update Dockerfile and start.sh with new file paths - Update README.md with correct script paths - Remove Python ask.py (replaced by Rust binary in ask/ask) - Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs) - Remove sentence-transformer dependencies from ask - Move docs and context artifacts to their directories
This commit is contained in:
3
.gitignore
vendored
3
.gitignore
vendored
@@ -3,6 +3,5 @@
|
||||
logs/
|
||||
done_tables.txt
|
||||
done_transfers.txt
|
||||
# CocoIndex Code (ccc)
|
||||
/.cocoindex_code/
|
||||
**/target
|
||||
*.log
|
||||
|
||||
@@ -28,8 +28,9 @@ ENV PATH="/root/.cargo/bin:${PATH}" \
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY basedosdados.duckdb Caddyfile start.sh auth.py ask.py ./
|
||||
RUN chmod +x start.sh
|
||||
COPY data/basedosdados.duckdb shell/Caddyfile shell/auth.py start.sh ./
|
||||
COPY ask/ask /app/ask
|
||||
RUN chmod +x start.sh /app/ask
|
||||
|
||||
EXPOSE 8080
|
||||
|
||||
|
||||
127
README.md
127
README.md
@@ -8,11 +8,13 @@ Os dados foram exportados do BigQuery para o Hetzner Object Storage (Helsinki) n
|
||||
|
||||
## Consultando os dados
|
||||
|
||||
Acesso via browser ou curl, protegido por senha. Peça a senha para o administrador.
|
||||
Acesso via browser ou curl, protegido por senha - peça!
|
||||
|
||||
### Shell no browser
|
||||
|
||||
Acesse **https://db.xn--2dk.xyz** → autentique → shell DuckDB interativo direto no browser.
|
||||
Acesse **https://db.ミ.xyz** → autentique → shell DuckDB interativo direto no browser.
|
||||
|
||||
Use `.tables` para listar os datasets.
|
||||
|
||||
### SQL via curl
|
||||
|
||||
@@ -46,35 +48,6 @@ curl -s -X POST https://db.xn--2dk.xyz/query \
|
||||
--data-binary @query.sql > resultado.csv
|
||||
```
|
||||
|
||||
### Descobrindo tabelas
|
||||
|
||||
```sql
|
||||
-- listar todos os datasets (schemas)
|
||||
SHOW SCHEMAS;
|
||||
|
||||
-- listar tabelas de um dataset
|
||||
SHOW TABLES IN br_anatel_banda_larga_fixa;
|
||||
|
||||
-- ver colunas de uma tabela
|
||||
DESCRIBE br_anatel_banda_larga_fixa.densidade_brasil;
|
||||
```
|
||||
|
||||
No shell do browser, `.tables` lista tudo de uma vez.
|
||||
|
||||
### Exportar em CSV ou JSON
|
||||
|
||||
O DuckDB permite formatar a saída diretamente na query:
|
||||
|
||||
```sql
|
||||
-- CSV com header (pipe para arquivo via curl)
|
||||
COPY (SELECT * FROM br_ibge_censo2022.municipios LIMIT 1000)
|
||||
TO '/dev/stdout' (FORMAT csv, HEADER true);
|
||||
|
||||
-- JSON
|
||||
SELECT * FROM br_ibge_censo2022.municipios LIMIT 10
|
||||
FORMAT JSON;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Exploração local
|
||||
@@ -82,11 +55,11 @@ FORMAT JSON;
|
||||
Para rodar as queries na sua própria máquina com DuckDB instalado:
|
||||
|
||||
```bash
|
||||
python prepara_db.py # gera basedosdados.duckdb com views apontando para o S3
|
||||
duckdb basedosdados.duckdb
|
||||
duckdb data/basedosdados.duckdb
|
||||
```
|
||||
|
||||
As queries são executadas diretamente sobre os arquivos Parquet no S3 — não há download de dados. O DuckDB lê os arquivos remotos sob demanda via `httpfs`.
|
||||
Precisa da credencial da .env - peça!
|
||||
|
||||
---
|
||||
|
||||
@@ -94,62 +67,52 @@ As queries são executadas diretamente sobre os arquivos Parquet no S3 — não
|
||||
|
||||
Interface TUI que permite fazer perguntas em português e obter SQL automaticamente.
|
||||
|
||||
### Arquitetura
|
||||
|
||||
```
|
||||
Pergunta → [schema filtrado] → LLM local (sqlcoder) ou API externa
|
||||
→ SQL
|
||||
```
|
||||
|
||||
1. **Schema filtrado**: As tabelas relevantes são filtradas e enviadas ao LLM
|
||||
2. **Geração SQL**: Modelo local (sqlcoder via Ollama) ou API externa (Gemini/OpenRouter)
|
||||
|
||||
### No browser
|
||||
|
||||
Acesse **https://ask.xn--2dk.xyz** → autentique → digite sua pergunta em português.
|
||||
Acesse **https://ask.ミ.xyz** → autentique → digite sua pergunta em português.
|
||||
|
||||
### Local
|
||||
|
||||
```bash
|
||||
# Compilar
|
||||
cd ask
|
||||
cargo build --release
|
||||
./target/release/ask # modo interativo
|
||||
./target/release/ask "Quantos municípios tem SP?" # modo CLI
|
||||
|
||||
# Modo interativo (TUI)
|
||||
./target/release/ask
|
||||
|
||||
# Modo CLI
|
||||
./target/release/ask "Quantos municípios tem SP?"
|
||||
```
|
||||
|
||||
### Variáveis de ambiente
|
||||
|
||||
| Variável | Descrição |
|
||||
|---|---|
|
||||
| `GEMINI_API_KEY` | Chave da API Gemini (obrigatória para usar modelos Gemini) |
|
||||
| `OPENROUTER_API_KEY` | Chave para usar modelos via OpenRouter |
|
||||
| `GEMINI_MODEL` | Modelo a usar (padrão: `gemini-flash-latest`) |
|
||||
| `SCHEMA_FILE` | Arquivo de schema (padrão: `context/schema_compact_inline.txt`) |
|
||||
| `DB_FILE` | Arquivo DuckDB (padrão: `basedosdados.duckdb`) |
|
||||
| Variável | Padrão | Descrição |
|
||||
|---|---|---|
|
||||
| `SQL_GENERATOR` | `gemini` | Generator: `sqlcoder`, `gemini`, ou `openrouter` |
|
||||
| `GEMINI_API_KEY` | — | Chave API Gemini (obrigatória se usar gemini) |
|
||||
| `OPENROUTER_API_KEY` | — | Chave API OpenRouter (obrigatória se usar openrouter) |
|
||||
| `GEMINI_MODEL` | `gemini-flash-lash` | Modelo Gemini |
|
||||
| `OPENROUTER_MODEL` | `openai/gpt-4o-mini` | Modelo OpenRouter |
|
||||
| `OLLAMA_MODEL` | `sqlcoder` | Modelo Ollama (sqlcoder ou sqlcoder:14b) |
|
||||
| `OLLAMA_HOST` | `http://localhost:11434` | Host Ollama |
|
||||
| `TOP_K_TABLES` | `5` | Número de tabelas a selecionar |
|
||||
| `SCHEMA_FILE` | `context/schema_compact_inline.txt` | Schema texto para fallback |
|
||||
| `SCHEMA_JSON` | `context/basedosdados-schema.json` | Schema JSON completo |
|
||||
| `DB_FILE` | `data/basedosdados.duckdb` | Arquivo DuckDB |
|
||||
|
||||
---
|
||||
|
||||
## Arquivos de schema
|
||||
|
||||
O diretório `context/` contém artefatos gerados automaticamente para contexto do LLM e descoberta de tabelas:
|
||||
|
||||
| Arquivo | Descrição |
|
||||
|---|---|
|
||||
| `schema_compact_inline.txt` | Schema condensado para contexto do LLM |
|
||||
| `schema_compact.txt` | Schema mais verboso |
|
||||
| `schema_ddl.sql` | DDL das views DuckDB |
|
||||
| `join_graph.json` | Relacionamentos entre tabelas |
|
||||
| `file_tree.md` | Estrutura de arquivos no S3 com tamanhos |
|
||||
| `schemas.json` | Schema raw do BigQuery |
|
||||
|
||||
---
|
||||
|
||||
## Descobrindo tabelas
|
||||
|
||||
```sql
|
||||
-- listar todos os datasets (schemas)
|
||||
SHOW SCHEMAS;
|
||||
|
||||
-- listar tabelas de um dataset
|
||||
SHOW TABLES IN br_anatel_banda_larga_fixa;
|
||||
|
||||
-- ver colunas de uma tabela
|
||||
DESCRIBE br_anatel_banda_larga_fixa.densidade_brasil;
|
||||
```
|
||||
|
||||
No shell do browser, `.tables` lista tudo de uma vez. Para descoberta programática, use os arquivos em `context/`.
|
||||
|
||||
---
|
||||
|
||||
## Pipeline de exportação
|
||||
|
||||
@@ -172,8 +135,8 @@ Resume automático: se interrompido, basta rodar novamente.
|
||||
|
||||
| Script | Função |
|
||||
|---|---|
|
||||
| `roda.sh` | Pipeline principal de exportação |
|
||||
| `prepara_db.py` | Gera `basedosdados.duckdb` com views para todas as tabelas |
|
||||
| `scripts/roda.sh` | Pipeline principal de exportação |
|
||||
| `scripts/prepara_db.py` | Gera `data/basedosdados.duckdb` com views para todas as tabelas |
|
||||
|
||||
### Configuração (`.env`)
|
||||
|
||||
@@ -196,10 +159,10 @@ Resume automático: se interrompido, basta rodar novamente.
|
||||
### Executando
|
||||
|
||||
```bash
|
||||
chmod +x roda.sh
|
||||
./roda.sh --dry-run # estima tamanho e custo
|
||||
./roda.sh # execução local
|
||||
./roda.sh --gcloud-run # cria VM no GCP, roda lá e deleta ao final
|
||||
chmod +x scripts/roda.sh
|
||||
./scripts/roda.sh --dry-run # estima tamanho e custo
|
||||
./scripts/roda.sh # execução local
|
||||
./scripts/roda.sh --gcloud-run # cria VM no GCP, roda lá e deleta ao final
|
||||
```
|
||||
|
||||
Autenticação GCP necessária antes da primeira exportação:
|
||||
@@ -219,8 +182,8 @@ Cria uma VM `e2-standard-4` Debian 12 em `us-central1-a`, copia o script e o `.e
|
||||
| `GCP_VM_NAME` | `bd-export-vm` | Nome da instância |
|
||||
| `GCP_VM_ZONE` | `us-central1-a` | Zona do Compute Engine |
|
||||
|
||||
### Deploy do servidor
|
||||
### Deploy do servidor para serviços de db e ask
|
||||
|
||||
```bash
|
||||
haloy deploy
|
||||
haloy deploy -f shell/haloy.yml
|
||||
```
|
||||
|
||||
129
ask.py
129
ask.py
@@ -1,129 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
ask.py — Send a Portuguese question to Gemini and get back SQL.
|
||||
|
||||
Usage:
|
||||
python ask.py "Quantos pedidos foram feitos por cliente no último mês?"
|
||||
python ask.py "Qual a taxa de mortalidade infantil por município em 2020?"
|
||||
|
||||
Env vars:
|
||||
GEMINI_API_KEY — required
|
||||
SCHEMA_FILE — path to DDL file (default: context/schema_compact_inline.txt)
|
||||
GEMINI_MODEL — model slug (default: gemini-2.0-flash-latest)
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import requests
|
||||
import duckdb
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
SCHEMA_FILE = os.getenv("SCHEMA_FILE", "context/schema_compact_inline.txt")
|
||||
MODEL = os.getenv("GEMINI_MODEL", "gemini-flash-latest")
|
||||
DB_FILE = os.getenv("DB_FILE", "basedosdados.duckdb")
|
||||
|
||||
|
||||
def load_schema(path: str) -> str:
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
return f.read()
|
||||
|
||||
|
||||
def ask(question: str) -> str:
|
||||
api_key = os.getenv("GEMINI_API_KEY")
|
||||
if not api_key:
|
||||
sys.exit("Error: GEMINI_API_KEY not set")
|
||||
|
||||
schema_ddl = load_schema(SCHEMA_FILE)
|
||||
|
||||
system_prompt = (
|
||||
"You are a SQL expert for Base dos Dados (basedosdados.org), "
|
||||
"a Brazilian open data warehouse with tables accessed via DuckDB.\n\n"
|
||||
"Rules:\n"
|
||||
"- Use DuckDB syntax. Tables are referenced as dataset.table.\n"
|
||||
"- Only use columns from the provided DDL — never invent column names.\n"
|
||||
"- Add WHERE filters on ano, sigla_uf, or id_municipio whenever possible.\n"
|
||||
"- Return ONLY the SQL query, no explanation, no markdown fences.\n\n"
|
||||
f"Schema DDL:\n\n{schema_ddl}"
|
||||
)
|
||||
|
||||
url = (
|
||||
f"https://generativelanguage.googleapis.com/v1beta/models"
|
||||
f"/{MODEL}:generateContent"
|
||||
)
|
||||
|
||||
payload = {
|
||||
"system_instruction": {
|
||||
"parts": [{"text": system_prompt}]
|
||||
},
|
||||
"contents": [
|
||||
{
|
||||
"parts": [{"text": question}]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
response = requests.post(
|
||||
url,
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"X-goog-api-key": api_key,
|
||||
},
|
||||
data=json.dumps(payload),
|
||||
timeout=300,
|
||||
)
|
||||
|
||||
response.raise_for_status()
|
||||
result = response.json()
|
||||
|
||||
return result["candidates"][0]["content"]["parts"][0]["text"].strip()
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print(f"Usage: python {sys.argv[0]} \"<pergunta em português>\"", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
question = " ".join(sys.argv[1:])
|
||||
print(f"Question: {question}\n", file=sys.stderr)
|
||||
print(f"Model: {MODEL}\n", file=sys.stderr)
|
||||
|
||||
sql = ask(question)
|
||||
|
||||
print(f"\n── SQL ──────────────────────────────────────────\n{sql}\n", file=sys.stderr)
|
||||
|
||||
con = duckdb.connect(DB_FILE, read_only=True)
|
||||
rel = con.sql(sql)
|
||||
|
||||
# box mode: build borders from column names + data
|
||||
cols = rel.columns
|
||||
rows = rel.fetchall()
|
||||
|
||||
if not rows:
|
||||
print("(no rows returned)")
|
||||
return
|
||||
|
||||
col_widths = [len(c) for c in cols]
|
||||
for row in rows:
|
||||
for i, val in enumerate(row):
|
||||
col_widths[i] = max(col_widths[i], len(str(val) if val is not None else "NULL"))
|
||||
|
||||
def bar(left, mid, right, fill="─"):
|
||||
return left + mid.join(fill * (w + 2) for w in col_widths) + right
|
||||
|
||||
header = "│" + "│".join(f" {c:{w}} " for c, w in zip(cols, col_widths)) + "│"
|
||||
|
||||
print(bar("┌", "┬", "┐"))
|
||||
print(header)
|
||||
print(bar("├", "┼", "┤"))
|
||||
for row in rows:
|
||||
vals = [str(v) if v is not None else "NULL" for v in row]
|
||||
print("│" + "│".join(f" {v:{w}} " for v, w in zip(vals, col_widths)) + "│")
|
||||
print(bar("└", "┴", "┘"))
|
||||
print(f"\n{len(rows)} row(s)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
1
ask/.dockerignore
Normal file
1
ask/.dockerignore
Normal file
@@ -0,0 +1 @@
|
||||
target
|
||||
1
ask/Cargo.lock
generated
1
ask/Cargo.lock
generated
@@ -252,6 +252,7 @@ dependencies = [
|
||||
"duckdb",
|
||||
"ratatui",
|
||||
"reqwest",
|
||||
"serde",
|
||||
"serde_json",
|
||||
"syntect",
|
||||
"tui-textarea",
|
||||
|
||||
@@ -9,6 +9,7 @@ path = "src/main.rs"
|
||||
|
||||
[dependencies]
|
||||
reqwest = { version = "0.12", features = ["blocking", "rustls-tls", "json"], default-features = false }
|
||||
serde = { version = "1", features = ["derive"] }
|
||||
serde_json = "1"
|
||||
duckdb = { version = "1", features = ["bundled"] }
|
||||
dotenvy = "0.15"
|
||||
|
||||
227
ask/src/main.rs
227
ask/src/main.rs
@@ -1,4 +1,9 @@
|
||||
mod schema_filter;
|
||||
mod sql_generator;
|
||||
mod table_selector;
|
||||
|
||||
use anyhow::{Context, Result};
|
||||
use chrono::Utc;
|
||||
use crossterm::{
|
||||
event::{
|
||||
DisableBracketedPaste, DisableMouseCapture, EnableBracketedPaste, EnableMouseCapture,
|
||||
@@ -9,14 +14,12 @@ use crossterm::{
|
||||
};
|
||||
use duckdb::Connection;
|
||||
use ratatui::{
|
||||
buffer::Buffer,
|
||||
layout::{Constraint, Direction, Layout, Rect},
|
||||
style::{Color, Modifier, Style},
|
||||
text::{Line, Span},
|
||||
widgets::{Block, Borders, Gauge, Paragraph, Row, Table, TableState, Wrap},
|
||||
Frame, Terminal,
|
||||
};
|
||||
use chrono::Utc;
|
||||
use serde_json::{json, Value};
|
||||
use std::{
|
||||
env, fs,
|
||||
@@ -43,6 +46,10 @@ struct Config {
|
||||
schema: String,
|
||||
db_file: String,
|
||||
prompt_file: String,
|
||||
use_table_selection: bool,
|
||||
embeddings_file: String,
|
||||
schema_json: String,
|
||||
similarity_threshold: f32,
|
||||
}
|
||||
|
||||
enum Phase {
|
||||
@@ -234,10 +241,23 @@ fn spawn_worker(
|
||||
model: String,
|
||||
prompt_file: String,
|
||||
db_file: String,
|
||||
use_table_selection: bool,
|
||||
embeddings_file: String,
|
||||
schema_json: String,
|
||||
similarity_threshold: f32,
|
||||
) -> mpsc::Receiver<WorkerMsg> {
|
||||
let (tx, rx) = mpsc::channel::<WorkerMsg>();
|
||||
std::thread::spawn(
|
||||
move || match ask_model(&question, &schema, &model, &prompt_file) {
|
||||
std::thread::spawn(move || {
|
||||
match ask_model_with_selection(
|
||||
&question,
|
||||
&schema,
|
||||
&model,
|
||||
&prompt_file,
|
||||
use_table_selection,
|
||||
&embeddings_file,
|
||||
&schema_json,
|
||||
similarity_threshold,
|
||||
) {
|
||||
Err(e) => {
|
||||
let err = format!("{:#}", e);
|
||||
log_question(&question, "", false, Some(&err));
|
||||
@@ -257,8 +277,8 @@ fn spawn_worker(
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
);
|
||||
}
|
||||
});
|
||||
rx
|
||||
}
|
||||
|
||||
@@ -270,6 +290,10 @@ fn spawn_retry_worker(
|
||||
model: String,
|
||||
prompt_file: String,
|
||||
db_file: String,
|
||||
use_table_selection: bool,
|
||||
embeddings_file: String,
|
||||
schema_json: String,
|
||||
similarity_threshold: f32,
|
||||
) -> mpsc::Receiver<WorkerMsg> {
|
||||
let retry_q = format!(
|
||||
"{}\n\nO SQL que você gerou falhou com este erro DuckDB:\n```\n{}\n```\n\n\
|
||||
@@ -277,7 +301,17 @@ fn spawn_retry_worker(
|
||||
Corrija o SQL. Retorne APENAS o SQL corrigido, sem explicação.",
|
||||
question, error, failed_sql
|
||||
);
|
||||
spawn_worker(retry_q, schema, model, prompt_file, db_file)
|
||||
spawn_worker(
|
||||
retry_q,
|
||||
schema,
|
||||
model,
|
||||
prompt_file,
|
||||
db_file,
|
||||
use_table_selection,
|
||||
embeddings_file,
|
||||
schema_json,
|
||||
similarity_threshold,
|
||||
)
|
||||
}
|
||||
|
||||
// ── event handling ────────────────────────────────────────────────────────────
|
||||
@@ -327,6 +361,10 @@ impl App {
|
||||
self.config.model.clone(),
|
||||
self.config.prompt_file.clone(),
|
||||
self.config.db_file.clone(),
|
||||
self.config.use_table_selection,
|
||||
self.config.embeddings_file.clone(),
|
||||
self.config.schema_json.clone(),
|
||||
self.config.similarity_threshold,
|
||||
));
|
||||
}
|
||||
|
||||
@@ -398,6 +436,10 @@ impl App {
|
||||
self.config.model.clone(),
|
||||
self.config.prompt_file.clone(),
|
||||
self.config.db_file.clone(),
|
||||
self.config.use_table_selection,
|
||||
self.config.embeddings_file.clone(),
|
||||
self.config.schema_json.clone(),
|
||||
self.config.similarity_threshold,
|
||||
));
|
||||
self.last_sql.clear();
|
||||
} else {
|
||||
@@ -723,7 +765,12 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
|
||||
let col_max_widths: Vec<usize> = (0..col_count)
|
||||
.map(|i| {
|
||||
let header_len = cols[i].len();
|
||||
let data_len = rows.iter().filter_map(|r| r.get(i)).map(|c| c.len()).max().unwrap_or(0);
|
||||
let data_len = rows
|
||||
.iter()
|
||||
.filter_map(|r| r.get(i))
|
||||
.map(|c| c.len())
|
||||
.max()
|
||||
.unwrap_or(0);
|
||||
(header_len.max(data_len)).max(min_col_width as usize)
|
||||
})
|
||||
.collect();
|
||||
@@ -732,16 +779,24 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
|
||||
let use_wrap = total_needed > available_width as usize;
|
||||
|
||||
if use_wrap {
|
||||
let wrap_width = (available_width as usize / col_count).max(min_col_width as usize);
|
||||
let header_lines: Vec<Line> = cols.iter()
|
||||
let wrap_width =
|
||||
(available_width as usize / col_count).max(min_col_width as usize);
|
||||
let header_lines: Vec<Line> = cols
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, c)| {
|
||||
let wrapped = wrap_text(c, wrap_width);
|
||||
Line::from(wrapped)
|
||||
let spans: Vec<Span> =
|
||||
wrapped.into_iter().map(|s| Span::raw(s)).collect();
|
||||
Line::from(spans)
|
||||
})
|
||||
.collect();
|
||||
|
||||
let max_header_lines = header_lines.iter().map(|l| l.len()).max().unwrap_or(1);
|
||||
let max_header_lines = header_lines
|
||||
.iter()
|
||||
.map(|l| l.spans.len())
|
||||
.max()
|
||||
.unwrap_or(1);
|
||||
|
||||
let mut all_row_lines: Vec<Vec<Line>> = Vec::new();
|
||||
for row in rows {
|
||||
@@ -749,19 +804,19 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
|
||||
.map(|i| {
|
||||
let cell = row.get(i).map(|s| s.as_str()).unwrap_or("");
|
||||
let wrapped = wrap_text(cell, wrap_width);
|
||||
Line::from(wrapped)
|
||||
let spans: Vec<Span> =
|
||||
wrapped.into_iter().map(|s| Span::raw(s)).collect();
|
||||
Line::from(spans)
|
||||
})
|
||||
.collect();
|
||||
let max_lines = row_lines.iter().map(|l| l.len()).max().unwrap_or(1);
|
||||
let max_lines = row_lines.iter().map(|l| l.spans.len()).max().unwrap_or(1);
|
||||
all_row_lines.push(row_lines);
|
||||
}
|
||||
|
||||
let selected_idx = table_state.selected().unwrap_or(0);
|
||||
let table_title = format!(" Resultados ({}/{}) ", selected_idx + 1, n);
|
||||
|
||||
let block = Block::default()
|
||||
.borders(Borders::ALL)
|
||||
.title(table_title);
|
||||
let block = Block::default().borders(Borders::ALL).title(table_title);
|
||||
|
||||
let area = chunks[2];
|
||||
f.render_widget(block, area);
|
||||
@@ -778,29 +833,32 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
|
||||
|
||||
let start_row = if n > visible_rows as usize {
|
||||
let scroll = selected_idx as i32 - visible_rows as i32 / 2;
|
||||
scroll.max(0) as usize.min(n.saturating_sub(visible_rows as usize))
|
||||
(scroll.max(0) as usize).min(n.saturating_sub(visible_rows as usize))
|
||||
} else {
|
||||
0
|
||||
};
|
||||
|
||||
let header_bg = Style::default().fg(Color::Yellow).add_modifier(Modifier::BOLD);
|
||||
let header_bg = Style::default()
|
||||
.fg(Color::Yellow)
|
||||
.add_modifier(Modifier::BOLD);
|
||||
for (col_idx, header_line) in header_lines.iter().enumerate() {
|
||||
let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1);
|
||||
let col_width = wrap_width as u16;
|
||||
for (line_idx, line) in header_line.iter().enumerate() {
|
||||
for (line_idx, span) in header_line.spans.iter().enumerate() {
|
||||
let y = inner_area.y + line_idx as u16;
|
||||
if y >= inner_area.y + inner_area.height {
|
||||
break;
|
||||
}
|
||||
let spans: Vec<Span> = line.spans.iter().map(|s| {
|
||||
Span::styled(s.content.clone(), header_bg)
|
||||
}).collect();
|
||||
f.render_widget(Paragraph::new(Line::from(spans)), Rect {
|
||||
x: col_x,
|
||||
y,
|
||||
width: col_width,
|
||||
height: 1,
|
||||
});
|
||||
let styled_span = Span::styled(span.content.clone(), header_bg);
|
||||
f.render_widget(
|
||||
Paragraph::new(Line::from(styled_span)),
|
||||
Rect {
|
||||
x: col_x,
|
||||
y,
|
||||
width: col_width,
|
||||
height: 1,
|
||||
},
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -811,7 +869,9 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
|
||||
}
|
||||
let is_selected = row_idx == selected_idx;
|
||||
let row_style = if is_selected {
|
||||
Style::default().bg(Color::DarkGray).add_modifier(Modifier::BOLD)
|
||||
Style::default()
|
||||
.bg(Color::DarkGray)
|
||||
.add_modifier(Modifier::BOLD)
|
||||
} else {
|
||||
Style::default()
|
||||
};
|
||||
@@ -820,20 +880,21 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
|
||||
for (col_idx, cell_lines) in row_lines.iter().enumerate() {
|
||||
let col_x = inner_area.x + (col_idx as u16) * (wrap_width as u16 + 1);
|
||||
let col_width = wrap_width as u16;
|
||||
for (line_idx, line) in cell_lines.iter().enumerate() {
|
||||
for (line_idx, span) in cell_lines.spans.iter().enumerate() {
|
||||
let cell_y = y + line_idx as u16;
|
||||
if cell_y >= inner_area.y + inner_area.height {
|
||||
break;
|
||||
}
|
||||
let spans: Vec<Span> = line.spans.iter().map(|s| {
|
||||
Span::styled(s.content.clone(), row_style)
|
||||
}).collect();
|
||||
f.render_widget(Paragraph::new(Line::from(spans)), Rect {
|
||||
x: col_x,
|
||||
y: cell_y,
|
||||
width: col_width,
|
||||
height: 1,
|
||||
});
|
||||
let styled_span = Span::styled(span.content.clone(), row_style);
|
||||
f.render_widget(
|
||||
Paragraph::new(Line::from(styled_span)),
|
||||
Rect {
|
||||
x: col_x,
|
||||
y: cell_y,
|
||||
width: col_width,
|
||||
height: 1,
|
||||
},
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -850,7 +911,8 @@ fn draw_content(f: &mut Frame, app: &mut App, area: Rect) {
|
||||
}
|
||||
}
|
||||
} else {
|
||||
let col_widths: Vec<Constraint> = cols.iter()
|
||||
let col_widths: Vec<Constraint> = cols
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, _)| {
|
||||
let w = col_max_widths[i] as u16;
|
||||
@@ -1008,6 +1070,55 @@ fn ask_model(question: &str, schema: &str, model: &str, prompt_file: &str) -> Re
|
||||
Ok(ensure_sql(&sql))
|
||||
}
|
||||
|
||||
fn ask_model_with_selection(
|
||||
question: &str,
|
||||
_full_schema: &str,
|
||||
model: &str,
|
||||
prompt_file: &str,
|
||||
use_selection: bool,
|
||||
embeddings_file: &str,
|
||||
schema_json: &str,
|
||||
similarity_threshold: f32,
|
||||
) -> Result<String> {
|
||||
let prompt_template = fs::read_to_string(prompt_file)
|
||||
.with_context(|| format!("Não foi possível ler o prompt: {}", prompt_file))?;
|
||||
|
||||
let (schema_to_use, selected_tables) = if use_selection {
|
||||
match table_selector::select_tables_from_question(
|
||||
question,
|
||||
embeddings_file,
|
||||
similarity_threshold,
|
||||
) {
|
||||
Ok(table_ids) => {
|
||||
eprintln!(
|
||||
"=> Selecionadas {} tables relevantes: {:?}",
|
||||
table_ids.len(),
|
||||
table_ids
|
||||
);
|
||||
let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
|
||||
let filtered_schema = schema_filter.filter_tables(&table_ids);
|
||||
(filtered_schema, Some(table_ids))
|
||||
}
|
||||
Err(e) => {
|
||||
eprintln!(
|
||||
"=> Aviso: falha na seleção de tables ({}), usando schema completo",
|
||||
e
|
||||
);
|
||||
let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
|
||||
(schema_filter.full_schema_text(), None)
|
||||
}
|
||||
}
|
||||
} else {
|
||||
let schema_filter = schema_filter::SchemaFilter::new(schema_json)?;
|
||||
(schema_filter.full_schema_text(), None)
|
||||
};
|
||||
|
||||
let generator = sql_generator::create_sql_generator()?;
|
||||
let sql = generator.generate(question, &schema_to_use, &prompt_template)?;
|
||||
|
||||
Ok(ensure_sql(&sql))
|
||||
}
|
||||
|
||||
fn ask_gemini(question: &str, system_prompt: &str, model: &str) -> Result<String> {
|
||||
let key = env::var("GEMINI_API_KEY").context("GEMINI_API_KEY não definida")?;
|
||||
let url = format!(
|
||||
@@ -1309,6 +1420,12 @@ VARIÁVEIS DE AMBIENTE
|
||||
OPENROUTER_API_KEY necessária para modelos OpenRouter
|
||||
GEMINI_MODEL modelo padrão (sobrescrito por --model)
|
||||
SCHEMA_FILE DDL do schema [context/schema_compact_inline.txt]
|
||||
SCHEMA_JSON full schema JSON [context/basedosdados-schema.json]
|
||||
EMBEDDINGS_FILE table embeddings [context/table_embeddings.json]
|
||||
TOP_K_TABLES número de tables a selecionar [5]
|
||||
SQL_GENERATOR sql generator: sqlcoder|gemini|openrouter [gemini]
|
||||
OLLAMA_MODEL modelo ollama [sqlcoder]
|
||||
OLLAMA_HOST host ollama [http://localhost:11434]
|
||||
PROMPT_FILE prompt do sistema [ask/system_prompt.md]
|
||||
DB_FILE banco DuckDB [basedosdados.duckdb]
|
||||
"#
|
||||
@@ -1321,7 +1438,18 @@ VARIÁVEIS DE AMBIENTE
|
||||
});
|
||||
let schema_file =
|
||||
env::var("SCHEMA_FILE").unwrap_or_else(|_| "context/schema_compact_inline.txt".into());
|
||||
let db_file = env::var("DB_FILE").unwrap_or_else(|_| "basedosdados.duckdb".into());
|
||||
let schema_json =
|
||||
env::var("SCHEMA_JSON").unwrap_or_else(|_| "context/basedosdados-schema.json".into());
|
||||
let embeddings_file =
|
||||
env::var("EMBEDDINGS_FILE").unwrap_or_else(|_| "context/table_embeddings.json".into());
|
||||
let similarity_threshold = env::var("SIMILARITY_THRESHOLD")
|
||||
.ok()
|
||||
.and_then(|v| v.parse().ok())
|
||||
.unwrap_or(0.35);
|
||||
let use_table_selection = env::var("USE_TABLE_SELECTION")
|
||||
.map(|v| v != "false" && v != "0")
|
||||
.unwrap_or(true);
|
||||
let db_file = env::var("DB_FILE").unwrap_or_else(|_| "data/basedosdados.duckdb".into());
|
||||
let prompt_file = env::var("PROMPT_FILE").unwrap_or_else(|_| "ask/system_prompt.md".into());
|
||||
let schema = fs::read_to_string(&schema_file)
|
||||
.with_context(|| format!("Não foi possível ler o schema: {}", schema_file))?;
|
||||
@@ -1333,6 +1461,10 @@ VARIÁVEIS DE AMBIENTE
|
||||
schema,
|
||||
db_file,
|
||||
prompt_file,
|
||||
use_table_selection,
|
||||
embeddings_file,
|
||||
schema_json,
|
||||
similarity_threshold,
|
||||
});
|
||||
}
|
||||
|
||||
@@ -1341,7 +1473,16 @@ VARIÁVEIS DE AMBIENTE
|
||||
eprintln!("\nModel: {}\nPergunta: {}\n", model, question);
|
||||
|
||||
let t0 = Instant::now();
|
||||
let sql = ask_model(&question, &schema, &model, &prompt_file)?;
|
||||
let sql = ask_model_with_selection(
|
||||
&question,
|
||||
&schema,
|
||||
&model,
|
||||
&prompt_file,
|
||||
use_table_selection,
|
||||
&embeddings_file,
|
||||
&schema_json,
|
||||
similarity_threshold,
|
||||
)?;
|
||||
eprintln!("=> SQL gerado em {}", fmt_duration(t0.elapsed()));
|
||||
print_sql_box(&sql);
|
||||
|
||||
|
||||
135
ask/src/schema_filter.rs
Normal file
135
ask/src/schema_filter.rs
Normal file
@@ -0,0 +1,135 @@
|
||||
use serde::{Deserialize, Serialize};
|
||||
use std::collections::HashSet;
|
||||
use std::fs;
|
||||
use std::path::Path;
|
||||
|
||||
#[derive(Debug, Clone, Deserialize, Serialize)]
|
||||
pub struct Column {
|
||||
pub name: String,
|
||||
#[serde(rename = "type")]
|
||||
pub col_type: String,
|
||||
pub description: Option<String>,
|
||||
}
|
||||
|
||||
pub type TableColumns = Vec<Column>;
|
||||
|
||||
#[derive(Debug, Clone, Deserialize)]
|
||||
pub struct FullSchema {
|
||||
#[serde(flatten)]
|
||||
pub datasets:
|
||||
std::collections::HashMap<String, std::collections::HashMap<String, TableColumns>>,
|
||||
}
|
||||
|
||||
pub struct SchemaFilter {
|
||||
schema: FullSchema,
|
||||
}
|
||||
|
||||
impl SchemaFilter {
|
||||
pub fn new<P: AsRef<Path>>(schema_path: P) -> anyhow::Result<Self> {
|
||||
let content = fs::read_to_string(schema_path)?;
|
||||
let schema: FullSchema = serde_json::from_str(&content)?;
|
||||
Ok(Self { schema })
|
||||
}
|
||||
|
||||
pub fn filter_tables(&self, table_ids: &[String]) -> String {
|
||||
let selected: HashSet<String> = table_ids.iter().cloned().collect();
|
||||
let mut lines = Vec::new();
|
||||
|
||||
lines.push("# Base dos Dados — Filtered Schema".to_string());
|
||||
lines.push(
|
||||
"# Legend: V=VARCHAR I=INT D=DOUBLE Dt=DATE B=BOOLEAN Dec=DECIMAL Ts=TIMESTAMP Ti=TIME"
|
||||
.to_string(),
|
||||
);
|
||||
lines.push("# Format: dataset.table: col:TYPE description".to_string());
|
||||
lines.push(String::new());
|
||||
|
||||
for (dataset, tables) in &self.schema.datasets {
|
||||
for (table, columns) in tables {
|
||||
let full_id = format!("{}.{}", dataset, table);
|
||||
if selected.contains(&full_id) {
|
||||
let col_str = columns
|
||||
.iter()
|
||||
.map(|c| {
|
||||
let desc = c.description.as_deref().unwrap_or("");
|
||||
if desc.is_empty() {
|
||||
format!("{}:{}", c.name, type_abbrev(&c.col_type))
|
||||
} else {
|
||||
format!("{}:{} {}", c.name, type_abbrev(&c.col_type), desc)
|
||||
}
|
||||
})
|
||||
.collect::<Vec<_>>()
|
||||
.join(" ");
|
||||
|
||||
lines.push(format!("{}: {}", full_id, col_str));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
lines.join("\n")
|
||||
}
|
||||
|
||||
pub fn full_schema_text(&self) -> String {
|
||||
let mut lines = Vec::new();
|
||||
|
||||
lines.push("# Base dos Dados — Full Schema".to_string());
|
||||
lines.push(
|
||||
"# Legend: V=VARCHAR I=INT D=DOUBLE Dt=DATE B=BOOLEAN Dec=DECIMAL Ts=TIMESTAMP Ti=TIME"
|
||||
.to_string(),
|
||||
);
|
||||
lines.push("# Format: dataset.table: col:TYPE description".to_string());
|
||||
lines.push(String::new());
|
||||
|
||||
for (dataset, tables) in &self.schema.datasets {
|
||||
for (table, columns) in tables {
|
||||
let full_id = format!("{}.{}", dataset, table);
|
||||
let col_str = columns
|
||||
.iter()
|
||||
.map(|c| {
|
||||
let desc = c.description.as_deref().unwrap_or("");
|
||||
if desc.is_empty() {
|
||||
format!("{}:{}", c.name, type_abbrev(&c.col_type))
|
||||
} else {
|
||||
format!("{}:{} {}", c.name, type_abbrev(&c.col_type), desc)
|
||||
}
|
||||
})
|
||||
.collect::<Vec<_>>()
|
||||
.join(" ");
|
||||
|
||||
lines.push(format!("{}: {}", full_id, col_str));
|
||||
}
|
||||
}
|
||||
|
||||
lines.join("\n")
|
||||
}
|
||||
|
||||
pub fn dataset_count(&self) -> usize {
|
||||
self.schema.datasets.len()
|
||||
}
|
||||
|
||||
pub fn table_count(&self) -> usize {
|
||||
self.schema.datasets.values().map(|t| t.len()).sum()
|
||||
}
|
||||
}
|
||||
|
||||
fn type_abbrev(full_type: &str) -> String {
|
||||
let upper = full_type.to_uppercase();
|
||||
if upper.contains("VARCHAR") || upper.contains("STRING") {
|
||||
"V".to_string()
|
||||
} else if upper.contains("INT") {
|
||||
"I".to_string()
|
||||
} else if upper.contains("DOUBLE") || upper.contains("FLOAT") {
|
||||
"D".to_string()
|
||||
} else if upper.contains("DATE") && !upper.contains("TIMESTAMP") {
|
||||
"Dt".to_string()
|
||||
} else if upper.contains("TIMESTAMP") {
|
||||
"Ts".to_string()
|
||||
} else if upper.contains("TIME") {
|
||||
"Ti".to_string()
|
||||
} else if upper.contains("BOOLEAN") {
|
||||
"B".to_string()
|
||||
} else if upper.contains("DECIMAL") {
|
||||
"Dec".to_string()
|
||||
} else {
|
||||
full_type.to_string()
|
||||
}
|
||||
}
|
||||
207
ask/src/sql_generator.rs
Normal file
207
ask/src/sql_generator.rs
Normal file
@@ -0,0 +1,207 @@
|
||||
use anyhow::{Context, Result};
|
||||
use serde_json::Value;
|
||||
use std::env;
|
||||
|
||||
pub trait SqlGenerator: Send + Sync {
|
||||
fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String>;
|
||||
}
|
||||
|
||||
pub fn create_sql_generator() -> Result<Box<dyn SqlGenerator>> {
|
||||
let generator_type = env::var("SQL_GENERATOR").unwrap_or_else(|_| "gemini".to_string());
|
||||
|
||||
match generator_type.as_str() {
|
||||
"sqlcoder" => Ok(Box::new(SqlCoderGenerator::new()?)),
|
||||
"openrouter" => Ok(Box::new(OpenRouterGenerator::new()?)),
|
||||
"gemini" => Ok(Box::new(GeminiGenerator::new()?)),
|
||||
_ => anyhow::bail!(
|
||||
"Unknown SQL_GENERATOR: {}. Use: sqlcoder, gemini, or openrouter",
|
||||
generator_type
|
||||
),
|
||||
}
|
||||
}
|
||||
|
||||
pub struct GeminiGenerator {
|
||||
model: String,
|
||||
api_key: String,
|
||||
}
|
||||
|
||||
impl GeminiGenerator {
|
||||
pub fn new() -> Result<Self> {
|
||||
let model = env::var("GEMINI_MODEL").unwrap_or_else(|_| "gemini-flash-latest".to_string());
|
||||
let api_key = env::var("GEMINI_API_KEY").context("GEMINI_API_KEY not defined")?;
|
||||
Ok(Self { model, api_key })
|
||||
}
|
||||
}
|
||||
|
||||
impl SqlGenerator for GeminiGenerator {
|
||||
fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
|
||||
let url = format!(
|
||||
"https://generativelanguage.googleapis.com/v1beta/models/{}:generateContent",
|
||||
self.model
|
||||
);
|
||||
|
||||
let system_prompt = format!("{}\n\nSchema DDL:\n\n{}", prompt_template.trim(), schema);
|
||||
|
||||
let payload = serde_json::json!({
|
||||
"system_instruction": { "parts": [{ "text": system_prompt }] },
|
||||
"contents": [{ "parts": [{ "text": question }] }]
|
||||
});
|
||||
|
||||
let client = reqwest::blocking::Client::builder()
|
||||
.timeout(std::time::Duration::from_secs(300))
|
||||
.build()?;
|
||||
|
||||
let resp = client
|
||||
.post(&url)
|
||||
.header("Content-Type", "application/json")
|
||||
.header("X-goog-api-key", &self.api_key)
|
||||
.json(&payload)
|
||||
.send()
|
||||
.context("Gemini HTTP request failed")?;
|
||||
|
||||
let status = resp.status();
|
||||
let body: Value = resp.json().context("Failed to parse Gemini response")?;
|
||||
|
||||
if !status.is_success() {
|
||||
anyhow::bail!("Gemini API error {}: {}", status, body);
|
||||
}
|
||||
|
||||
let text = body["candidates"][0]["content"]["parts"][0]["text"]
|
||||
.as_str()
|
||||
.context("Unexpected Gemini response format")?
|
||||
.trim()
|
||||
.to_string();
|
||||
|
||||
Ok(strip_fences(&text))
|
||||
}
|
||||
}
|
||||
|
||||
pub struct OpenRouterGenerator {
|
||||
model: String,
|
||||
api_key: String,
|
||||
}
|
||||
|
||||
impl OpenRouterGenerator {
|
||||
pub fn new() -> Result<Self> {
|
||||
let model =
|
||||
env::var("OPENROUTER_MODEL").unwrap_or_else(|_| "openai/gpt-4o-mini".to_string());
|
||||
let api_key = env::var("OPENROUTER_API_KEY").context("OPENROUTER_API_KEY not defined")?;
|
||||
Ok(Self { model, api_key })
|
||||
}
|
||||
}
|
||||
|
||||
impl SqlGenerator for OpenRouterGenerator {
|
||||
fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
|
||||
let url = "https://openrouter.ai/api/v1/chat/completions";
|
||||
|
||||
let system_prompt = format!("{}\n\nSchema DDL:\n\n{}", prompt_template.trim(), schema);
|
||||
|
||||
let payload = serde_json::json!({
|
||||
"model": self.model,
|
||||
"messages": [
|
||||
{ "role": "system", "content": system_prompt },
|
||||
{ "role": "user", "content": question }
|
||||
]
|
||||
});
|
||||
|
||||
let client = reqwest::blocking::Client::builder()
|
||||
.timeout(std::time::Duration::from_secs(300))
|
||||
.build()?;
|
||||
|
||||
let resp = client
|
||||
.post(url)
|
||||
.header("Content-Type", "application/json")
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.header("HTTP-Referer", "https://basedosdados.org")
|
||||
.header("X-Title", "Base dos Dados Ask")
|
||||
.json(&payload)
|
||||
.send()
|
||||
.context("OpenRouter HTTP request failed")?;
|
||||
|
||||
let status = resp.status();
|
||||
let body: Value = resp.json().context("Failed to parse OpenRouter response")?;
|
||||
|
||||
if !status.is_success() {
|
||||
anyhow::bail!("OpenRouter API error {}: {}", status, body);
|
||||
}
|
||||
|
||||
let text = body["choices"][0]["message"]["content"]
|
||||
.as_str()
|
||||
.context("Unexpected OpenRouter response format")?
|
||||
.trim()
|
||||
.to_string();
|
||||
|
||||
Ok(strip_fences(&text))
|
||||
}
|
||||
}
|
||||
|
||||
pub struct SqlCoderGenerator {
|
||||
model: String,
|
||||
host: String,
|
||||
}
|
||||
|
||||
impl SqlCoderGenerator {
|
||||
pub fn new() -> Result<Self> {
|
||||
let model = env::var("OLLAMA_MODEL").unwrap_or_else(|_| "sqlcoder".to_string());
|
||||
let host = env::var("OLLAMA_HOST").unwrap_or_else(|_| "http://localhost:11434".to_string());
|
||||
Ok(Self { model, host })
|
||||
}
|
||||
}
|
||||
|
||||
impl SqlGenerator for SqlCoderGenerator {
|
||||
fn generate(&self, question: &str, schema: &str, prompt_template: &str) -> Result<String> {
|
||||
let url = format!("{}/api/generate", self.host);
|
||||
|
||||
let full_prompt = format!(
|
||||
"{}\n\nSchema DDL:\n\n{}\n\nQuestion: {}\n\nSQL:",
|
||||
prompt_template.trim(),
|
||||
schema,
|
||||
question
|
||||
);
|
||||
|
||||
let payload = serde_json::json!({
|
||||
"model": self.model,
|
||||
"prompt": full_prompt,
|
||||
"stream": false
|
||||
});
|
||||
|
||||
let client = reqwest::blocking::Client::builder()
|
||||
.timeout(std::time::Duration::from_secs(300))
|
||||
.build()?;
|
||||
|
||||
let resp = client
|
||||
.post(&url)
|
||||
.header("Content-Type", "application/json")
|
||||
.json(&payload)
|
||||
.send()
|
||||
.context("Ollama HTTP request failed")?;
|
||||
|
||||
let status = resp.status();
|
||||
let body: Value = resp.json().context("Failed to parse Ollama response")?;
|
||||
|
||||
if !status.is_success() {
|
||||
anyhow::bail!("Ollama API error {}: {}", status, body);
|
||||
}
|
||||
|
||||
let text = body["response"]
|
||||
.as_str()
|
||||
.context("Unexpected Ollama response format")?
|
||||
.trim()
|
||||
.to_string();
|
||||
|
||||
Ok(strip_fences(&text))
|
||||
}
|
||||
}
|
||||
|
||||
fn strip_fences(text: &str) -> String {
|
||||
let text = text.trim();
|
||||
if text.starts_with("```sql") {
|
||||
let end = text.find("```").unwrap_or(text.len());
|
||||
text[5..end].trim().to_string()
|
||||
} else if text.starts_with("```") {
|
||||
let end = text[3..].find("```").map(|i| i + 3).unwrap_or(text.len());
|
||||
text[3..end].trim().to_string()
|
||||
} else {
|
||||
text.to_string()
|
||||
}
|
||||
}
|
||||
146
ask/src/table_selector.rs
Normal file
146
ask/src/table_selector.rs
Normal file
@@ -0,0 +1,146 @@
|
||||
use serde::{Deserialize, Serialize};
|
||||
use std::fs;
|
||||
use std::path::Path;
|
||||
|
||||
const DEFAULT_SIMILARITY_THRESHOLD: f32 = 0.35;
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct TableEmbedding {
|
||||
pub id: String,
|
||||
pub text: String,
|
||||
pub embedding: Vec<f32>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct EmbeddingsIndex {
|
||||
pub tables: Vec<TableEmbedding>,
|
||||
pub model: String,
|
||||
}
|
||||
|
||||
pub struct TableSelector {
|
||||
tables: Vec<TableEmbedding>,
|
||||
threshold: f32,
|
||||
}
|
||||
|
||||
impl TableSelector {
|
||||
pub fn new<P: AsRef<Path>>(embeddings_path: P, threshold: f32) -> anyhow::Result<Self> {
|
||||
let content = fs::read_to_string(embeddings_path)?;
|
||||
let index: EmbeddingsIndex = serde_json::from_str(&content)?;
|
||||
Ok(Self {
|
||||
tables: index.tables,
|
||||
threshold,
|
||||
})
|
||||
}
|
||||
|
||||
pub fn select_tables(
|
||||
&self,
|
||||
question: &str,
|
||||
model: &dyn QuestionEmbedder,
|
||||
) -> anyhow::Result<Vec<String>> {
|
||||
let question_embedding = model.embed(question)?;
|
||||
|
||||
let mut similarities: Vec<(usize, f32)> = self
|
||||
.tables
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(i, table)| {
|
||||
let sim = cosine_similarity(&question_embedding, &table.embedding);
|
||||
(i, sim)
|
||||
})
|
||||
.collect();
|
||||
|
||||
similarities.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
|
||||
|
||||
let selected: Vec<String> = similarities
|
||||
.into_iter()
|
||||
.filter(|(_, sim)| *sim >= self.threshold)
|
||||
.map(|(i, sim)| {
|
||||
eprintln!(" {} (similarity: {:.3})", self.tables[i].id, sim);
|
||||
self.tables[i].id.clone()
|
||||
})
|
||||
.collect();
|
||||
|
||||
Ok(selected)
|
||||
}
|
||||
|
||||
pub fn get_table_texts(&self, table_ids: &[String]) -> Vec<String> {
|
||||
table_ids
|
||||
.iter()
|
||||
.filter_map(|id| self.tables.iter().find(|t| &t.id == id))
|
||||
.map(|t| t.text.clone())
|
||||
.collect()
|
||||
}
|
||||
|
||||
pub fn table_count(&self) -> usize {
|
||||
self.tables.len()
|
||||
}
|
||||
}
|
||||
|
||||
pub trait QuestionEmbedder: Send + Sync {
|
||||
fn embed(&self, text: &str) -> anyhow::Result<Vec<f32>>;
|
||||
}
|
||||
|
||||
pub struct LocalEmbedder {
|
||||
model_path: String,
|
||||
}
|
||||
|
||||
impl LocalEmbedder {
|
||||
pub fn new(model_path: String) -> Self {
|
||||
Self { model_path }
|
||||
}
|
||||
}
|
||||
|
||||
impl QuestionEmbedder for LocalEmbedder {
|
||||
fn embed(&self, text: &str) -> anyhow::Result<Vec<f32>> {
|
||||
use std::process::Command;
|
||||
|
||||
let output = Command::new("python3")
|
||||
.args([
|
||||
"-c",
|
||||
&format!(
|
||||
r#"
|
||||
import json
|
||||
from sentence_transformers import SentenceTransformer
|
||||
model = SentenceTransformer('{}')
|
||||
emb = model.encode('{}', convert_to_numpy=True)
|
||||
print(json.dumps([float(x) for x in emb]))
|
||||
"#,
|
||||
self.model_path,
|
||||
text.replace("'", "\\'")
|
||||
),
|
||||
])
|
||||
.output()?;
|
||||
|
||||
if !output.status.success() {
|
||||
let err = String::from_utf8_lossy(&output.stderr);
|
||||
anyhow::bail!("Embedding generation failed: {}", err);
|
||||
}
|
||||
|
||||
let output_str = String::from_utf8_lossy(&output.stdout);
|
||||
let floats: Vec<f32> = serde_json::from_str(&output_str)?;
|
||||
|
||||
Ok(floats)
|
||||
}
|
||||
}
|
||||
|
||||
fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
|
||||
let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
|
||||
let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
|
||||
let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
|
||||
|
||||
if norm_a == 0.0 || norm_b == 0.0 {
|
||||
0.0
|
||||
} else {
|
||||
dot_product / (norm_a * norm_b)
|
||||
}
|
||||
}
|
||||
|
||||
pub fn select_tables_from_question(
|
||||
question: &str,
|
||||
embeddings_path: &str,
|
||||
threshold: f32,
|
||||
) -> anyhow::Result<Vec<String>> {
|
||||
let selector = TableSelector::new(embeddings_path, threshold)?;
|
||||
let embedder = LocalEmbedder::new("all-MiniLM-L6-v2".to_string());
|
||||
selector.select_tables(question, &embedder)
|
||||
}
|
||||
@@ -147,3 +147,68 @@ LIMIT 30
|
||||
if the question requires tables not in the provided DDL, OR
|
||||
If you cant generate a valid SQL,
|
||||
answer as a JSON {error: "#{reason}"}
|
||||
|
||||
|
||||
## Common SQL Pitfalls & Debugging Strategy
|
||||
|
||||
### 1. Column Propagation in CTEs (Most Common Error!)
|
||||
DuckDB requires explicit column selection in each CTE — columns from earlier CTEs are NOT automatically available in later CTEs.
|
||||
|
||||
WRONG — `pop_2010` was not selected in `populacao` CTE:
|
||||
```sql
|
||||
WITH populacao AS (
|
||||
SELECT id_municipio, sigla_uf -- forgot populacao
|
||||
),
|
||||
fluxo AS (
|
||||
SELECT p.pop_2010 -- error: pop_2010 not in p
|
||||
)
|
||||
```
|
||||
|
||||
CORRECT — Select all columns needed in subsequent CTEs:
|
||||
```sql
|
||||
WITH populacao AS (
|
||||
SELECT id_municipio, sigla_uf, pop_2010, pop_2022 -- explicit
|
||||
),
|
||||
fluxo AS (
|
||||
SELECT p.pop_2010 -- works
|
||||
)
|
||||
```
|
||||
|
||||
### 2. ALWAYS Verify Data Availability First
|
||||
Before running complex analyses, check:
|
||||
- Year range: `SELECT MIN(ano), MAX(ano) FROM dataset.table`
|
||||
- Record count: `SELECT COUNT(*) FROM dataset.table`
|
||||
- ID format compatibility between tables before JOIN
|
||||
|
||||
### 3. Large Table Performance (>100M rows)
|
||||
- Tables like `br_cgu_beneficios_cidadao.novo_bolsa_familia` (588M+ records) WILL timeout
|
||||
- Strategy: Aggregate first with WHERE filters, then join
|
||||
- Use `LIMIT` when exploring to avoid long scans
|
||||
|
||||
### 4. Lock Conflicts
|
||||
Multiple concurrent DuckDB queries on the same `.duckdb` file cause lock errors.
|
||||
- Wait between queries or use read-only mode
|
||||
|
||||
### 5. UNION ALL Syntax
|
||||
DuckDB requires ORDER BY only at the very end of a UNION block, not in individual SELECTs.
|
||||
|
||||
WRONG:
|
||||
```sql
|
||||
SELECT ... LIMIT 5
|
||||
ORDER BY x
|
||||
UNION ALL
|
||||
SELECT ... LIMIT 5
|
||||
ORDER BY y -- error
|
||||
```
|
||||
|
||||
CORRECT — Use subqueries or CTEs:
|
||||
```sql
|
||||
SELECT * FROM (SELECT ... ORDER BY x LIMIT 5) a
|
||||
UNION ALL
|
||||
SELECT * FROM (SELECT ... ORDER BY y LIMIT 5) b
|
||||
```
|
||||
|
||||
### 6. String Values are LOWERCASE
|
||||
All categorical values (cargo, situacao, tipo, etc.) are stored in lowercase.
|
||||
Always use: `WHERE cargo = 'deputado federal'` not `'DEPUTADO FEDERAL'`
|
||||
|
||||
|
||||
Binary file not shown.
2075
context/basedosdados-schema.json
Normal file
2075
context/basedosdados-schema.json
Normal file
File diff suppressed because one or more lines are too long
298355
context/table_embeddings.json
Normal file
298355
context/table_embeddings.json
Normal file
File diff suppressed because one or more lines are too long
BIN
data/basedosdados.duckdb
Normal file
BIN
data/basedosdados.duckdb
Normal file
Binary file not shown.
59
docs/dataset_embeds.md
Normal file
59
docs/dataset_embeds.md
Normal file
@@ -0,0 +1,59 @@
|
||||
## Goal
|
||||
|
||||
Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.
|
||||
|
||||
## Instructions
|
||||
|
||||
- Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
|
||||
- Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
|
||||
- Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
|
||||
- Include column descriptions from basedosdados-schema.json in table embeddings
|
||||
- Generate word clouds from schema attributes and dataset names for docs
|
||||
|
||||
## Discoveries
|
||||
|
||||
- **Schema format**: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
|
||||
- **Embeddings work**: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
|
||||
- **Threshold tuning**: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
|
||||
- **sqlcoder issues**: Returns JSON instead of SQL when using `format: "json"` - removing it helps but still generates imperfect SQL
|
||||
- **Retry mechanism**: Already built into main.rs - helps fix SQL errors automatically
|
||||
- **Top donation query works**: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)
|
||||
|
||||
## Accomplished
|
||||
|
||||
1. ✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json
|
||||
2. ✅ Created table_embeddings.json (~2MB, 765 tables)
|
||||
3. ✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
|
||||
4. ✅ Created schema_filter.rs - extracts filtered schema from full JSON
|
||||
5. ✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
|
||||
6. ✅ Modified main.rs - integrated table selection + configurable SQL generator
|
||||
7. ✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes)
|
||||
8. ✅ Updated README.md with new architecture and env vars
|
||||
9. ✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/
|
||||
|
||||
## Relevant files / directories
|
||||
|
||||
### Created/Modified
|
||||
- `embed_tables.py` - Python script to generate table embeddings
|
||||
- `context/table_embeddings.json` - Pre-computed embeddings (765 tables)
|
||||
- `ask/src/table_selector.rs` - Table selection via embeddings
|
||||
- `ask/src/schema_filter.rs` - Schema filtering module
|
||||
- `ask/src/sql_generator.rs` - SQL generator trait + implementations
|
||||
- `ask/src/main.rs` - Integrated all components
|
||||
- `ask/Cargo.toml` - Added serde dependency
|
||||
- `README.md` - Updated with new architecture
|
||||
- `docs/wordcloud_attributes.png` - Word cloud from column names/descriptions
|
||||
- `docs/wordcloud_datasets.png` - Word cloud from dataset names
|
||||
|
||||
### Configuration (env vars)
|
||||
- `SQL_GENERATOR` - sqlcoder|gemini|openrouter
|
||||
- `SIMILARITY_THRESHOLD` - 0.35 default
|
||||
- `OLLAMA_MODEL` - sqlcoder:7b-q4_K_M
|
||||
- `EMBEDDINGS_FILE`, `SCHEMA_JSON`
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Increase similarity threshold (try 0.45) to reduce table count
|
||||
- Improve sqlcoder prompt for better SQL generation
|
||||
- Add fallback to increase threshold if too many tables selected
|
||||
- Consider keyword matching as backup if embeddings fail
|
||||
299
docs/patterns-audit.md
Normal file
299
docs/patterns-audit.md
Normal file
@@ -0,0 +1,299 @@
|
||||
# Pattern Audit — Robustness & False Positive Analysis
|
||||
|
||||
Deep audit of all 8 risk patterns. For each pattern: legal basis, threshold rationale, known false positive scenarios, data quality notes, and differences between the per-CNPJ (interactive) and batch (scan-all) implementations.
|
||||
|
||||
---
|
||||
|
||||
## US1 — Split Contracts Below Threshold (`split_contracts_below_threshold`)
|
||||
|
||||
### Legal basis
|
||||
**Fracionamento de licitação** is prohibited by:
|
||||
- Lei 8.666/1993, art. 23, §5º: "É vedada a utilização da modalidade 'convite' ou 'tomada de preços' [...] para parcelas de uma mesma obra ou serviço."
|
||||
- Lei 14.133/2021, art. 145: directly prohibits splitting to evade the mandatory bidding requirement.
|
||||
|
||||
### Threshold: year-dependent
|
||||
|
||||
| Period | Threshold | Legal basis |
|
||||
|---|---|---|
|
||||
| ≤ 2023 | R$ 17.600 | Decreto 9.412/2018 / Lei 8.666/93 art. 23, I, "a" |
|
||||
| 2024+ | R$ 57.912 | Decreto 11.871/2024 / Lei 14.133/2021 art. 75, I |
|
||||
|
||||
For 2023 data many contracts still ran under Lei 8.666/93 (both laws co-existed). From 2024 the threshold is R$57.912. Using a static R$17.600 for 2024+ data would miss the main fraud window (R$17k–R$57k per contract). **Fixed (iteration 7):** all three implementations compute the threshold from the query year.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Legitimate multi-item purchasing**: A supplier providing diverse small items (office supplies, food for canteen) legitimately generates many small contracts below threshold from the same agency. The `combined_value > threshold` guard reduces but doesn't eliminate this.
|
||||
2. **Recurring service contracts**: Monthly service fees (e.g., R$1.500/month cleaning) generate 12 contracts/year — correctly NOT flagged (combined = R$18.000 > threshold, count ≥ 3 in first 3 months).
|
||||
3. **Different sub-units**: The grouping uses `id_orgao_superior` (ministry level). A ministry with many sub-units contracting independently may not be splitting; they may have independent needs.
|
||||
|
||||
### Improvements applied
|
||||
- None structural. Filter `valor_inicial_compra > 0` prevents division issues.
|
||||
|
||||
### Known data quality issues
|
||||
- `data_assinatura_contrato` can be NULL for some older contracts. **`FORMAT_DATE` on NULL returns NULL — it does NOT exclude those rows.** Without a guard, all NULL-dated contracts from the same agency would be grouped together under a single `NULL` month bucket, potentially producing a false flag if ≥3 of them are below threshold with combined value > threshold. Fixed (iteration 5): all three implementations now include `AND data_assinatura_contrato IS NOT NULL` in the WHERE clause.
|
||||
- `valor_inicial_compra` vs `valor_final_compra`: we use `valor_inicial_compra` intentionally since splitting is defined by the contract as signed, not final.
|
||||
|
||||
### Improvements applied (iteration 5)
|
||||
- Added `AND data_assinatura_contrato IS NOT NULL` to WHERE clause in all three implementations to prevent NULL-date contracts from being grouped into a spurious `mes = NULL` bucket.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 8): `scan-all.ts` now includes `id_orgao_superior` in both SELECT and GROUP BY, matching `index.ts` and `scan-suspicious.ts`. Prevents theoretical merging of two distinct ministries sharing the same name.
|
||||
|
||||
---
|
||||
|
||||
## US2 — Contract Concentration (`contract_concentration`)
|
||||
|
||||
### Legal basis
|
||||
No specific legal prohibition, but **TCU** and **CGU** audit methodology treat >40% share of a single agency's budget as a prima facie risk indicator requiring justification.
|
||||
- Reference: CGU "Manual de Orientações para Análise de Risco em Compras Públicas" (2022), section 4.2.
|
||||
|
||||
### Thresholds
|
||||
- **40% share**: empirical; above this, competition is functionally absent for that agency.
|
||||
- **R$ 50.000 minimum agency total**: excludes micro-units (small local offices) where one purchase naturally dominates.
|
||||
- **R$ 10.000 minimum supplier spend** (new, iteration 2): excludes trivial cases like a company with R$21k of a R$50k agency = 42% but both numbers are small.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Specialized niches**: A sole provider of a specialized service (e.g., judicial translation, specific medical device) may legitimately dominate one agency's procurement. No CNAE-based filter exists.
|
||||
2. **Monopolistic markets**: Some goods/services have few suppliers by nature (utilities, telecommunications infrastructure).
|
||||
3. **Framework agreements**: A single framework contract can make one supplier appear to dominate even if bidding was competitive at framework establishment.
|
||||
|
||||
### Improvements applied
|
||||
- Added `CONCENTRATION_MIN_SUPPLIER_SPEND = 10_000` to batch query and `scan-suspicious.ts` (iteration 2).
|
||||
- Added `CONCENTRATION_MIN_SUPPLIER_SPEND` filter to `index.ts` `patternConcentration` HAVING clause (iteration 4 — was present in batch/scan-suspicious but missing from web UI).
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 4): `index.ts` HAVING clause now includes `supplier_spend >= CONCENTRATION_MIN_SUPPLIER_SPEND`.
|
||||
✅ Fixed (iteration 9): `scan-all.ts` and `scan-suspicious.ts` now group by `(id_orgao_superior, nome_orgao_superior)` in both the spend and ministry_total CTEs, joining on the composite key. All three implementations are consistent.
|
||||
|
||||
---
|
||||
|
||||
## US3 — Inexigibility Recurrence (`inexigibility_recurrence`)
|
||||
|
||||
### Legal basis
|
||||
**Inexigibilidade de licitação** (Lei 14.133/2021 art. 74; Lei 8.666/93 art. 25) is legal when competition is technically impossible (e.g., exclusive supplier, artistic performances). Abuse occurs when agencies use inexigibilidade repeatedly for the same supplier to avoid competitive bidding.
|
||||
- Reference: **TCU Acórdão 1.793/2011**: defines recurrent inexigibilidade as a risk indicator requiring documentation of technical exclusivity per contract.
|
||||
|
||||
### Threshold: 3 contracts per managing unit
|
||||
- Below 3: could be two legitimate sole-source needs in the same year.
|
||||
- At 3+: pattern suggests systematic routing of contracts to avoid bidding.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Legitimate exclusive suppliers**: Publishers (publishing rights), performing arts venues, specialized IT vendors with proprietary systems legitimately receive many inexigibilidade contracts.
|
||||
2. **Long-term technical partnerships**: An agency may have a multi-year framework with an exclusive technical partner, generating many inexigibilidade contracts each year.
|
||||
3. **Artistic/cultural organizations**: Museums, theaters, and orchestras commonly contract artists via inexigibilidade.
|
||||
|
||||
### Improvements applied (iteration 2)
|
||||
- **Batch + scan-suspicious**: Now groups by `id_unidade_gestora` (ID) + `nome_unidade_gestora` (name). Previously grouped by name only, risking merger of distinct units sharing a common name.
|
||||
- **Batch + scan-suspicious**: Added `valor_inicial_compra >= R$ 1.000` filter. Micro-value contracts (< R$1k) rarely represent real abuse.
|
||||
|
||||
### Improvements applied (iteration 4)
|
||||
- **`index.ts`**: Added `AND valor_inicial_compra >= @min_value` to WHERE clause of `patternInexigibility`. The web UI was missing this filter, causing micro-value contracts to inflate the count and trigger false flags.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 4): all three implementations now filter `valor_inicial_compra >= R$ 1.000` and group by `id_unidade_gestora`.
|
||||
|
||||
---
|
||||
|
||||
## US4 — Single Bidder (`single_bidder`)
|
||||
|
||||
### Legal basis
|
||||
Not inherently illegal, but flagged by:
|
||||
- **Open Contracting Partnership "73 Red Flags" (2024)**, Flag #1: "Only one bid received."
|
||||
- CGU "Programa de Fiscalização em Entes Federativos" 2023: single-bidder rate >30% is a tier-1 risk indicator.
|
||||
|
||||
### Threshold: 2 occurrences
|
||||
- Intentionally low. Even one solo-bid win warrants investigation context. Two is the minimum pattern.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Specialized markets**: Satellite communications, nuclear materials, specialized medical devices — few vendors exist globally.
|
||||
2. **Geographic isolation**: Remote municipalities with limited local suppliers naturally attract few bidders even for standard goods.
|
||||
3. **Poorly timed notices**: Short bid windows or holiday periods reduce participation regardless of market structure.
|
||||
|
||||
### SQL robustness notes
|
||||
- Per-CNPJ: uses `STARTS_WITH(REGEXP_REPLACE(...), @cnpj)` — this matches any CNPJ where the base 8 digits match, including subsidiaries/branches. This is intentional: a corporate group that operates through multiple CNPJs should still surface.
|
||||
- Batch: uses `MAX(IF(vencedor AND LENGTH(...) = 14, SUBSTR(...), NULL))` to extract the winner's CNPJ from the `auction_stats` CTE. The `LENGTH = 14` guard in the `IF` condition ensures CPF winners don't produce invalid 8-digit keys. If two CNPJ rows have `vencedor=true` for the same auction (data quality issue), `MAX` picks lexicographically last — acceptable for batch purposes.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 8): **batch now counts ALL participants** (CPF + CNPJ) for `total_bidders`, matching per-CNPJ behavior. Previously, `LENGTH = 14` excluded CPF individuals from the count, causing the batch to over-flag auctions where a CPF participant was present. The `LENGTH = 14` guard is now applied only inside the `winner_cnpj` extraction `IF()` condition — not to the overall participant count.
|
||||
|
||||
---
|
||||
|
||||
## US5 — Always Winner (`always_winner`)
|
||||
|
||||
### Legal basis
|
||||
Not illegal per se, but high win rates in competitive auctions indicate possible:
|
||||
- Bid rigging (Lei 12.529/2011 art. 36, IV)
|
||||
- Tailored specifications (Lei 14.133/2021 art. 9, I)
|
||||
- Reference: **OCDE "Guidelines for Fighting Bid Rigging in Public Procurement" (2021)**
|
||||
|
||||
### Thresholds
|
||||
- **≥80% win rate** (per-CNPJ, fixed) — raised from 60% to reduce false positives. Batch uses dynamic Q3 (empirically ≈100% in this dataset).
|
||||
- **≥10 competitive participations** — minimum sample for statistical significance. Aligns batch and per-CNPJ.
|
||||
- **Competitive auctions only (≥2 bidders)** — critical to avoid overlap with US4.
|
||||
|
||||
### Critical fix applied (iteration 2)
|
||||
**The per-CNPJ version was NOT filtering for competitive auctions before this iteration.** A company that always won because it was always the only bidder would be flagged by both US4 (single_bidder) AND US5 (always_winner) — misleading double-counting. Fixed by adding a `competitive_auctions` CTE that filters `COUNT(1) >= 2`.
|
||||
|
||||
### Win rate distribution note
|
||||
The `licitacao_participante` dataset is **strongly bimodal**: approximately 33% of companies with ≥10 competitive participations have a perfect 100% win rate. The distribution does not follow a normal or uniform pattern. Q3 ≈ 1.0 regardless of the minimum sample cutoff (tested at 5, 10, 20). The dynamic Q3 threshold therefore flags only **perfect-win companies** — intentionally strict. This is documented in the spec.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 2): both now filter for competitive auctions. Batch uses dynamic Q3; per-CNPJ uses fixed 0.80 threshold. The fixed threshold produces a slightly broader result set on the interactive page, which is acceptable — the batch feed should be conservative; per-CNPJ investigation mode can be more sensitive.
|
||||
|
||||
---
|
||||
|
||||
## US6 — Amendment Inflation (`amendment_inflation`)
|
||||
|
||||
### Legal basis
|
||||
**Lei 14.133/2021 art. 125 §1º**: amendments may not increase the contract value by more than 25% of the original (for goods/services) or 50% (for construction). Inflation ≥ 1.25× means the contract **reached or exceeded its legal ceiling**.
|
||||
|
||||
### Threshold: 1.25× (25% above original)
|
||||
- Exactly the legal maximum. Contracts at 1.25× are at the legal limit; contracts above are potentially illegal unless specific circumstances apply (art. 125 §2º exceptions).
|
||||
|
||||
### False positive scenarios
|
||||
1. **Lawful exceptional amendments**: Art. 125 §2º allows exceeding 25% for "additional work indispensable to the object's completion" — requires specific administrative justification.
|
||||
2. **Construction contracts**: Legal ceiling is 50% (not 25%). Our threshold of 1.25× flags construction contracts that are within the legal limit.
|
||||
3. **Value adjustment clauses**: Contracts with inflation adjustment clauses (INPC/IPCA) can legitimately reach or exceed 1.25× over multi-year terms without any amendment.
|
||||
4. **Data entry errors**: Some `valor_final_compra` values are clearly data quality issues (e.g., 100× original).
|
||||
|
||||
### Improvements applied (iteration 3)
|
||||
- **Cap `inflation_ratio` at 10×** (`AMENDMENT_MAX_INFLATION_RATIO = 10.0`): ratios above this threshold are almost certainly data entry errors (e.g., `valor_final_compra` entered in a different unit) and would distort `total_excess` reporting. Applied to all three implementations via `AND ... <= @max_ratio` filter in SQL. Applied in `index.ts`, `scan-all.ts`, `scan-suspicious.ts`.
|
||||
|
||||
### Schema verification: construction vs goods/services threshold
|
||||
Lei 14.133/2021 art.125 §1º allows 50% amendments for engineering works vs 25% for goods/services.
|
||||
|
||||
**Column verified (schema dump):** `contrato_compra` has `id_modalidade_licitacao` (code) and `modalidade_licitacao` (name). However, this column encodes **bidding modality** (Concorrência, Pregão Eletrônico, Tomada de Preços, etc.) — not contract category (obras vs bens/serviços). There is no `tipo_contrato` or `categoria` column in the accessible schema.
|
||||
|
||||
### Improvements applied (iteration 8): construction keyword detection
|
||||
All three implementations now apply `IF(REGEXP_CONTAINS(LOWER(IFNULL(objeto, '')), r'obra|constru|reform|engenhari|paviment|demoli'), 1.50, 1.25)` to select the applicable legal threshold per contract. This reduces false positives for legitimate construction/engineering amendments that fall between 1.25× and 1.50×.
|
||||
|
||||
**Keywords and rationale:**
|
||||
| Keyword | Matches | Rationale |
|
||||
|---------|---------|-----------|
|
||||
| `obra` | obra, obras | General construction work |
|
||||
| `constru` | construção, construir | Building/construction |
|
||||
| `reform` | reforma, reformar, reformas | Renovation/remodeling |
|
||||
| `engenhari` | engenharia, engenheiro | Engineering services |
|
||||
| `paviment` | pavimentação, pavimento | Road/floor paving |
|
||||
| `demoli` | demolição, demolir | Demolition |
|
||||
|
||||
**Known limitations:** The `objeto` field is free-text entered by procurement officers. Some construction contracts may use generic descriptions ("serviços de manutenção") and be missed by this detection — applying the 1.25× threshold is safe for those (conservative false positive vs missed construction exemption).
|
||||
|
||||
### Improvements applied (iteration 9): constructionCount field
|
||||
`AmendmentInflationFlag` now includes `constructionCount`: the number of flagged contracts that matched the construction keywords and were therefore evaluated at the 1.50× threshold. The UI card shows this count with a tooltip explaining the applicable legal ceiling. This helps analysts distinguish "inflated by >25% on goods (potentially illegal)" from "inflated by >50% on obras (definitely exceeds even the construction ceiling)."
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
⚠️ Minor divergence (accepted): `index.ts` includes the aditivos CTE (`zeroAmendmentCount`) and `constructionCount` from `is_construction`. The batch scanners do NOT include these — `contrato_termo_aditivo` full scan is too expensive in batch, and `constructionCount` is per-row info not aggregable without the row-level data. Both fields are only available in the web UI's per-CNPJ output.
|
||||
|
||||
---
|
||||
|
||||
## US7 — Newborn Company (`newborn_company`)
|
||||
|
||||
### Legal basis
|
||||
No specific prohibition, but:
|
||||
- **Lei 14.133/2021 art. 68, I**: suppliers must demonstrate technical and economic qualification. Newly incorporated companies rarely can.
|
||||
- CGU "Guia Prático de Análise de Empresas de Fachada" (2021): age < 6 months at contract signing is a tier-1 indicator of possible shell company.
|
||||
|
||||
### Thresholds
|
||||
- **180 days** (6 months): practical minimum for legitimate operational readiness.
|
||||
- **R$ 50.000 minimum contract value**: excludes training contracts and small acquisitions where new companies are common and low-risk.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Spinoffs and restructurings**: A newly incorporated CNPJ may be a restructured entity of an existing business with full operational capacity.
|
||||
2. **Holding company structures**: A holding created to receive a specific contract may have the technical capacity of its parent, not its founding date.
|
||||
3. **Startups in innovation programs**: Government startup accelerator programs (e.g., FAPESP TT, EMBRAPII) specifically contract very new companies.
|
||||
4. **`data_inicio_atividade` from establishments**: The founding date comes from `br_me_cnpj.estabelecimentos`, not `empresas`. Branches opened after the headquarter can make an established company appear "newborn" in a specific municipality.
|
||||
|
||||
### Data quality note
|
||||
`data_inicio_atividade` lives in `br_me_cnpj.estabelecimentos`, NOT `empresas`. The query uses `MIN(est.data_inicio_atividade)` across all establishments for the same `cnpj_basico` — this correctly picks the earliest known opening date, reducing the false positive of branches.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Equivalent. Both use `MIN(data_inicio_atividade)` across establishments with `ano=2023 AND mes=12`.
|
||||
|
||||
⚠️ **Known necessary full-table scan**: The `first_contract` CTE in `batchNewborn` (`scan-all.ts`) intentionally omits an `ano` filter on `contrato_compra`:
|
||||
```sql
|
||||
FROM `basedosdados.br_cgu_licitacao_contrato.contrato_compra`
|
||||
WHERE LENGTH(REGEXP_REPLACE(cpf_cnpj_contratado, r'\D', '')) = 14
|
||||
AND valor_final_compra >= <MIN_VALUE>
|
||||
GROUP BY cnpj_basico
|
||||
```
|
||||
This is a deliberate exception to the "zero full-table scans" rule from the spec. The pattern asks: *"did this company win its very first contract within 180 days of founding?"* Restricting to `ano = ANO` would miss the true first contract if it occurred in an earlier year — producing a false negative. The `founding` CTE correctly filters `e.ano = ANO AND est.ano = ANO AND est.mes = 12`. Only `first_contract` scans all years, but the `LENGTH = 14` CPF exclusion and `valor_final_compra >= R$ 50k` filter significantly reduce bytes scanned.
|
||||
|
||||
---
|
||||
|
||||
## US8 — Sudden Surge (`sudden_surge`)
|
||||
|
||||
### Legal basis
|
||||
Not illegal, but flagged by:
|
||||
- **UNODC "Guidebook on anti-corruption in public procurement" (2013)**: "Sudden large increase in a company's public contract revenue" is a tier-2 risk indicator.
|
||||
- TCU Acórdão 2.622/2015: large YoY procurement increases without prior procurement history warrant scrutiny.
|
||||
|
||||
### Thresholds
|
||||
- **5× YoY growth**: chosen to exclude normal business growth (2-3×) while flagging exponential jumps.
|
||||
- **R$ 1.000.000 minimum**: a 5× jump from R$200k to R$1M is meaningful; from R$10k to R$50k is noise.
|
||||
- **4-year lookback**: captures context before the surge.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Post-restructuring recovery**: A company that was inactive for 2 years then resumed full operations would appear to surge.
|
||||
2. **New framework agreements**: Being added to a large framework agreement in year N can produce apparent surge with no underlying change in the company.
|
||||
3. **Government budget cycles**: Some sectors receive large multi-year contracts every 4 years (e.g., IT system replacements) creating apparent surges.
|
||||
|
||||
### SQL robustness note
|
||||
Both per-CNPJ and batch use `prev_v > 0` guard to exclude zero→nonzero transitions (handled by US7 newborn_company instead). The batch uses `LAG` window function; per-CNPJ iterates over the history array client-side.
|
||||
|
||||
**Consecutive-year guard (iteration 6):** The spec says `value[year_N] / value[year_N-1]`. Without a guard, `LAG` compares any adjacent rows in sorted order — if a company had data in 2019 and 2023 (dormant 2020–2022), the comparison spans 4 years and produces a false surge. Fixed by:
|
||||
- `scan-all.ts`: added `LAG(ano)` alongside `LAG(v)` and `WHERE ano - prev_ano = 1`
|
||||
- `index.ts`, `scan-suspicious.ts`: added `curr.ano - prev.ano === 1` to the JS loop condition
|
||||
|
||||
**false positive (false negative from audit):** The first false positive scenario (post-restructuring recovery) is now LESS likely to trigger since the consecutive-year guard would catch companies dormant for ≥1 year.
|
||||
|
||||
The per-CNPJ implementation reports only the **first** qualifying surge year (breaks on first hit). If a company surged twice, only the earlier event is shown. This is conservative.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Equivalent. Batch uses SQL `LAG`; per-CNPJ uses JS loop. Both find the first qualifying year.
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Issue: Cache Miss vs Stored Null
|
||||
|
||||
### Bug 1: Cache Miss vs Stored Null (fixed iteration 6)
|
||||
|
||||
`cache.ts` `getCache` was returning `null` for both cache misses (file not found) and legitimately stored null values (pattern found nothing). Patterns US4–US8 and the company lookup all use `null` as their "nothing found" sentinel and check `cached !== undefined` to skip re-querying. With the old `getCache` returning `null` on miss, `null !== undefined` evaluated to `true`, causing the BigQuery query to be skipped permanently — US4–US8 would never execute on a CNPJ not yet in cache.
|
||||
|
||||
**Fix:** `getCache` now returns `undefined` on miss or expiry; returns `T` (including `null`) on a valid cache hit. The company-lookup caller that used `!== null` was updated to `!== undefined`.
|
||||
|
||||
### Bug 2: Falsy cache check for array-returning patterns (fixed iteration 7)
|
||||
|
||||
US1, US2, US3, and `runPatterns()` in `index.ts` used `if (cached) return cached` to check for cache hits. An empty array `[]` is **falsy** in JavaScript — so a cached "no flags found" result (a real cache hit) was silently discarded, causing BigQuery to be re-queried on every subsequent call for clean CNPJs.
|
||||
|
||||
Affected: `patternSplitContracts`, `patternConcentration`, `patternInexigibility`, `runPatterns`.
|
||||
|
||||
**Fix:** changed all four to `if (cached !== undefined) return cached`. (US4–US8 already used this pattern since they cache `null` as "nothing found" — they were correct.)
|
||||
|
||||
---
|
||||
|
||||
## Cross-Pattern Issues
|
||||
|
||||
### Overlap between US4 and US5
|
||||
- **Before iteration 2**: US5 per-CNPJ would flag solo-bid winners as "always winner", creating confusing double flags.
|
||||
- **After iteration 2**: US5 filters to competitive auctions only. A pure solo-bid company gets US4 only; a company that wins competitive auctions at high rates gets US5 only; both behaviors together get both flags independently.
|
||||
|
||||
### Overlap between US7 and US8
|
||||
- A newborn company with a sudden surge would be flagged by both US7 (age at contract) and US8 (YoY growth). This is intentional and additive — both signals reinforce each other.
|
||||
|
||||
### CNPJ matching strategy
|
||||
All patterns use `cnpj_basico` (8-digit root) as the joining key. This means **all branches and subsidiaries** of a corporate group are attributed to the same `cnpj_basico`. This can create false positives for large corporations with many legitimate establishments (e.g., Correios, Petrobras) that naturally have contracts across many agencies.
|
||||
|
||||
---
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Pattern | FP Risk | Legal Basis | Fixes Applied |
|
||||
|---------|---------|------------|---------------|
|
||||
| US1 Split | Medium — multi-item purchasing | Decreto 9.412/2018 / Decreto 11.871/2024 | NULL date guard; year-dependent threshold (R$17.600 ≤2023, R$57.912 2024+); falsy cache check fixed; **batch GROUP BY now includes id_orgao_superior** |
|
||||
| US2 Concentration | Medium — specialized markets | CGU 2022 methodology | Added min supplier spend to all 3 implementations; **falsy cache check fixed**; **all 3 now GROUP BY (id+name) — no ministry-name collision** |
|
||||
| US3 Inexigibility | High — legitimate exclusive suppliers | TCU Acórdão 1.793/2011 | Fixed grouping by ID; added min value to all 3 implementations; **falsy cache check fixed** |
|
||||
| US4 Single Bidder | Medium — specialized/remote markets | OCP 2024 Flag #1 | **cache.ts bug fixed** (getCache null-vs-undefined); **batch now counts all participants (CPF+CNPJ)** — consistent with per-CNPJ |
|
||||
| US5 Always Winner | **Was HIGH** (no competitive filter) → Now Medium | OCDE 2021 | Fixed: competitive auctions only; raised thresholds; **cache.ts bug fixed** |
|
||||
| US6 Amendment | Medium — inflation clauses | Lei 14.133/2021 art.125 | Added 10× inflation cap; **cache.ts bug fixed**; **construction keyword detection: 1.50× threshold for obras/etc.**; **constructionCount in UI flag** |
|
||||
| US7 Newborn | High — spinoffs, restructurings | CGU 2021 guide | **cache.ts bug fixed** (was never querying BigQuery on cache miss) |
|
||||
| US8 Surge | Medium — framework agreements, budget cycles | UNODC 2013 | Added consecutive-year guard; **cache.ts bug fixed** |
|
||||
BIN
docs/wordcloud_attributes.png
Normal file
BIN
docs/wordcloud_attributes.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1.8 MiB |
45
docs/wordcloud_attributes.py
Normal file
45
docs/wordcloud_attributes.py
Normal file
@@ -0,0 +1,45 @@
|
||||
#!/usr/bin/env python3
|
||||
import json
|
||||
import re
|
||||
from collections import Counter
|
||||
from wordcloud import WordCloud
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
STOPWORDS = {'de', 'do', 'da', 'a', 'ou', 'em', 'e', 'o', 'que', 'das', 'dos', 'nos', 'nas', 'um', 'uma', 'para', 'com', 'não', 'uma', 'à', 'ao', 'os', 'as', 'se', 'na', 'no', 'de', 'do', 'da', 'é', 'ser', 'seu', 'sua', 'isso', 'the', 'of', 'and', 'in', 'to', 'is', 'for', 'on', 'with', 'at', 'by', 'from'}
|
||||
|
||||
with open('context/basedosdados-schema.json') as f:
|
||||
schema = json.load(f)
|
||||
|
||||
words = []
|
||||
for dataset, tables in schema.items():
|
||||
for table, cols in tables.items():
|
||||
for col in cols:
|
||||
name = col.get('name', '').lower()
|
||||
desc = col.get('description', '').lower()
|
||||
if name and len(name) >= 3:
|
||||
words.append(name)
|
||||
if desc:
|
||||
for w in desc.split():
|
||||
w = re.sub(r'[^a-záàâãéèêíìîóòôõúùûç]', '', w)
|
||||
if len(w) >= 3 and w not in STOPWORDS:
|
||||
words.append(w)
|
||||
|
||||
word_freq = Counter(words)
|
||||
|
||||
wc = WordCloud(
|
||||
width=1600,
|
||||
height=800,
|
||||
background_color='white',
|
||||
max_words=200,
|
||||
colormap='viridis',
|
||||
min_font_size=8
|
||||
).generate_from_frequencies(word_freq)
|
||||
|
||||
plt.figure(figsize=(20, 10))
|
||||
plt.imshow(wc, interpolation='bilinear')
|
||||
plt.axis('off')
|
||||
plt.tight_layout(pad=0)
|
||||
plt.savefig('docs/wordcloud_attributes.png', dpi=150, bbox_inches='tight')
|
||||
print("Saved docs/wordcloud_attributes.png")
|
||||
print(f"Total unique words: {len(word_freq)}")
|
||||
print("Top 30:", word_freq.most_common(30))
|
||||
BIN
docs/wordcloud_datasets.png
Normal file
BIN
docs/wordcloud_datasets.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1.3 MiB |
33
docs/wordcloud_datasets.py
Normal file
33
docs/wordcloud_datasets.py
Normal file
@@ -0,0 +1,33 @@
|
||||
#!/usr/bin/env python3
|
||||
import json
|
||||
from collections import Counter
|
||||
from wordcloud import WordCloud
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
with open('context/basedosdados-schema.json') as f:
|
||||
schema = json.load(f)
|
||||
|
||||
dataset_names = []
|
||||
for dataset in schema.keys():
|
||||
parts = dataset.replace('br_', '').replace('mundo_', '').replace('eu_', '').split('_')
|
||||
dataset_names.extend([p for p in parts if len(p) >= 3])
|
||||
|
||||
word_freq = Counter(dataset_names)
|
||||
|
||||
wc = WordCloud(
|
||||
width=1600,
|
||||
height=800,
|
||||
background_color='white',
|
||||
max_words=100,
|
||||
colormap='plasma',
|
||||
min_font_size=10
|
||||
).generate_from_frequencies(word_freq)
|
||||
|
||||
plt.figure(figsize=(20, 10))
|
||||
plt.imshow(wc, interpolation='bilinear')
|
||||
plt.axis('off')
|
||||
plt.tight_layout(pad=0)
|
||||
plt.savefig('docs/wordcloud_datasets.png', dpi=150, bbox_inches='tight')
|
||||
print("Saved docs/wordcloud_datasets.png")
|
||||
print(f"Total unique words: {len(word_freq)}")
|
||||
print("Top 30:", word_freq.most_common(30))
|
||||
268
gera_schemas.py
268
gera_schemas.py
@@ -1,268 +0,0 @@
|
||||
import os
|
||||
import json
|
||||
import sys
|
||||
import pyarrow.parquet as pq
|
||||
import s3fs
|
||||
import boto3
|
||||
import duckdb
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
S3_ENDPOINT = os.environ["HETZNER_S3_ENDPOINT"]
|
||||
S3_BUCKET = os.environ["HETZNER_S3_BUCKET"]
|
||||
ACCESS_KEY = os.environ["AWS_ACCESS_KEY_ID"]
|
||||
SECRET_KEY = os.environ["AWS_SECRET_ACCESS_KEY"]
|
||||
|
||||
s3_host = S3_ENDPOINT.removeprefix("https://").removeprefix("http://")
|
||||
|
||||
# --- boto3 client (listing only, zero egress) ---
|
||||
boto = boto3.client(
|
||||
"s3",
|
||||
endpoint_url=S3_ENDPOINT,
|
||||
aws_access_key_id=ACCESS_KEY,
|
||||
aws_secret_access_key=SECRET_KEY,
|
||||
)
|
||||
|
||||
# --- s3fs filesystem (footer-only reads via pyarrow) ---
|
||||
fs = s3fs.S3FileSystem(
|
||||
client_kwargs={"endpoint_url": S3_ENDPOINT},
|
||||
key=ACCESS_KEY,
|
||||
secret=SECRET_KEY,
|
||||
)
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# Phase 1: File inventory via S3 List API (zero data egress)
|
||||
# ------------------------------------------------------------------ #
|
||||
print("Phase 1: listing S3 objects...")
|
||||
paginator = boto.get_paginator("list_objects_v2")
|
||||
|
||||
inventory = {} # "dataset/table" -> {files: [...], total_size: int}
|
||||
|
||||
for page in paginator.paginate(Bucket=S3_BUCKET):
|
||||
for obj in page.get("Contents", []):
|
||||
key = obj["Key"]
|
||||
if not key.endswith(".parquet"):
|
||||
continue
|
||||
parts = key.split("/")
|
||||
if len(parts) < 3:
|
||||
continue
|
||||
dataset, table = parts[0], parts[1]
|
||||
dt = f"{dataset}/{table}"
|
||||
if dt not in inventory:
|
||||
inventory[dt] = {"files": [], "total_size_bytes": 0}
|
||||
inventory[dt]["files"].append(key)
|
||||
inventory[dt]["total_size_bytes"] += obj["Size"]
|
||||
|
||||
print(f" Found {len(inventory)} tables across {S3_BUCKET}")
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# Phase 2: Schema reads — footer only (~30 KB per table)
|
||||
# ------------------------------------------------------------------ #
|
||||
print("Phase 2: reading parquet footers...")
|
||||
|
||||
def fmt_size(b):
|
||||
for unit in ("B", "KB", "MB", "GB", "TB"):
|
||||
if b < 1024 or unit == "TB":
|
||||
return f"{b:.1f} {unit}"
|
||||
b /= 1024
|
||||
|
||||
def extract_col_descriptions(schema):
|
||||
"""Try to pull per-column descriptions from Arrow metadata."""
|
||||
descriptions = {}
|
||||
meta = schema.metadata or {}
|
||||
# BigQuery exports embed a JSON blob under b'pandas' with column_info
|
||||
pandas_meta_raw = meta.get(b"pandas") or meta.get(b"pandas_metadata")
|
||||
if pandas_meta_raw:
|
||||
try:
|
||||
pm = json.loads(pandas_meta_raw)
|
||||
for col in pm.get("columns", []):
|
||||
name = col.get("name")
|
||||
desc = col.get("metadata", {}) or {}
|
||||
if isinstance(desc, dict) and "description" in desc:
|
||||
descriptions[name] = desc["description"]
|
||||
except Exception:
|
||||
pass
|
||||
# Also try top-level b'description' or b'schema'
|
||||
for key in (b"description", b"schema", b"BigQuery:description"):
|
||||
val = meta.get(key)
|
||||
if val:
|
||||
try:
|
||||
descriptions["__table__"] = val.decode("utf-8", errors="replace")
|
||||
except Exception:
|
||||
pass
|
||||
return descriptions
|
||||
|
||||
schemas = {}
|
||||
errors = []
|
||||
|
||||
for i, (dt, info) in enumerate(sorted(inventory.items())):
|
||||
dataset, table = dt.split("/", 1)
|
||||
first_file = info["files"][0]
|
||||
s3_path = f"{S3_BUCKET}/{first_file}"
|
||||
try:
|
||||
schema = pq.read_schema(fs.open(s3_path))
|
||||
col_descs = extract_col_descriptions(schema)
|
||||
|
||||
# Build raw metadata dict (decode bytes keys/values)
|
||||
raw_meta = {}
|
||||
if schema.metadata:
|
||||
for k, v in schema.metadata.items():
|
||||
try:
|
||||
dk = k.decode("utf-8", errors="replace")
|
||||
dv = v.decode("utf-8", errors="replace")
|
||||
# Try to parse JSON values
|
||||
try:
|
||||
dv = json.loads(dv)
|
||||
except Exception:
|
||||
pass
|
||||
raw_meta[dk] = dv
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
columns = []
|
||||
for field in schema:
|
||||
col = {
|
||||
"name": field.name,
|
||||
"type": str(field.type),
|
||||
"nullable": field.nullable,
|
||||
}
|
||||
if field.name in col_descs:
|
||||
col["description"] = col_descs[field.name]
|
||||
# Check field-level metadata
|
||||
if field.metadata:
|
||||
for k, v in field.metadata.items():
|
||||
try:
|
||||
dk = k.decode("utf-8", errors="replace")
|
||||
dv = v.decode("utf-8", errors="replace")
|
||||
if dk in ("description", "DESCRIPTION", "comment"):
|
||||
col["description"] = dv
|
||||
except Exception:
|
||||
pass
|
||||
columns.append(col)
|
||||
|
||||
schemas[f"{dataset}.{table}"] = {
|
||||
"path": f"s3://{S3_BUCKET}/{dataset}/{table}/",
|
||||
"file_count": len(info["files"]),
|
||||
"total_size_bytes": info["total_size_bytes"],
|
||||
"total_size_human": fmt_size(info["total_size_bytes"]),
|
||||
"columns": columns,
|
||||
"metadata": raw_meta,
|
||||
}
|
||||
print(f" [{i+1}/{len(inventory)}] ✓ {dataset}.{table} ({len(columns)} cols, {fmt_size(info['total_size_bytes'])})")
|
||||
except Exception as e:
|
||||
errors.append({"table": f"{dataset}.{table}", "error": str(e)})
|
||||
print(f" [{i+1}/{len(inventory)}] ✗ {dataset}.{table}: {e}", file=sys.stderr)
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# Phase 3: Enrich from br_bd_metadados.bigquery_tables (small table)
|
||||
# ------------------------------------------------------------------ #
|
||||
META_TABLE = "br_bd_metadados.bigquery_tables"
|
||||
meta_dt = "br_bd_metadados/bigquery_tables"
|
||||
|
||||
if meta_dt in inventory:
|
||||
print(f"Phase 3: enriching from {META_TABLE}...")
|
||||
try:
|
||||
con = duckdb.connect()
|
||||
con.execute("INSTALL httpfs; LOAD httpfs;")
|
||||
con.execute(f"""
|
||||
SET s3_endpoint='{s3_host}';
|
||||
SET s3_access_key_id='{ACCESS_KEY}';
|
||||
SET s3_secret_access_key='{SECRET_KEY}';
|
||||
SET s3_url_style='path';
|
||||
""")
|
||||
meta_path = f"s3://{S3_BUCKET}/br_bd_metadados/bigquery_tables/*.parquet"
|
||||
# Peek at available columns
|
||||
available = [r[0] for r in con.execute(f"DESCRIBE SELECT * FROM '{meta_path}' LIMIT 1").fetchall()]
|
||||
print(f" Metadata columns: {available}")
|
||||
|
||||
# Try to find dataset/table description columns
|
||||
desc_col = next((c for c in available if "description" in c.lower()), None)
|
||||
ds_col = next((c for c in available if c.lower() in ("dataset_id", "dataset", "schema_name")), None)
|
||||
tbl_col = next((c for c in available if c.lower() in ("table_id", "table_name", "table")), None)
|
||||
|
||||
if desc_col and ds_col and tbl_col:
|
||||
rows = con.execute(f"""
|
||||
SELECT {ds_col}, {tbl_col}, {desc_col}
|
||||
FROM '{meta_path}'
|
||||
""").fetchall()
|
||||
for ds, tbl, desc in rows:
|
||||
key = f"{ds}.{tbl}"
|
||||
if key in schemas and desc:
|
||||
schemas[key]["table_description"] = desc
|
||||
print(f" Enriched {len(rows)} table descriptions")
|
||||
else:
|
||||
print(f" Could not find expected columns (dataset_id, table_id, description) — skipping enrichment")
|
||||
con.close()
|
||||
except Exception as e:
|
||||
print(f" Enrichment failed: {e}", file=sys.stderr)
|
||||
else:
|
||||
print("Phase 3: br_bd_metadados.bigquery_tables not in S3 — skipping enrichment")
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# Phase 4a: Write schemas.json
|
||||
# ------------------------------------------------------------------ #
|
||||
print("Phase 4: writing outputs...")
|
||||
|
||||
output = {
|
||||
"_meta": {
|
||||
"bucket": S3_BUCKET,
|
||||
"total_tables": len(schemas),
|
||||
"total_size_bytes": sum(v["total_size_bytes"] for v in schemas.values()),
|
||||
"total_size_human": fmt_size(sum(v["total_size_bytes"] for v in schemas.values())),
|
||||
"errors": errors,
|
||||
},
|
||||
"tables": dict(sorted(schemas.items())),
|
||||
}
|
||||
|
||||
with open("schemas.json", "w", encoding="utf-8") as f:
|
||||
json.dump(output, f, ensure_ascii=False, indent=2)
|
||||
|
||||
print(f" ✓ schemas.json ({len(schemas)} tables)")
|
||||
|
||||
# ------------------------------------------------------------------ #
|
||||
# Phase 4b: Write file_tree.md
|
||||
# ------------------------------------------------------------------ #
|
||||
lines = [
|
||||
f"# S3 File Tree: {S3_BUCKET}",
|
||||
"",
|
||||
]
|
||||
|
||||
# Group by dataset
|
||||
datasets_map = {}
|
||||
for dt_key, info in sorted(inventory.items()):
|
||||
dataset, table = dt_key.split("/", 1)
|
||||
datasets_map.setdefault(dataset, []).append((table, info))
|
||||
|
||||
total_files = sum(len(v["files"]) for v in inventory.values())
|
||||
total_bytes = sum(v["total_size_bytes"] for v in inventory.values())
|
||||
|
||||
for dataset, tables in sorted(datasets_map.items()):
|
||||
ds_bytes = sum(i["total_size_bytes"] for _, i in tables)
|
||||
ds_files = sum(len(i["files"]) for _, i in tables)
|
||||
lines.append(f"## {dataset}/ ({len(tables)} tables, {fmt_size(ds_bytes)}, {ds_files} files)")
|
||||
lines.append("")
|
||||
for table, info in sorted(tables):
|
||||
schema_entry = schemas.get(f"{dataset}.{table}", {})
|
||||
ncols = len(schema_entry.get("columns", []))
|
||||
col_str = f", {ncols} cols" if ncols else ""
|
||||
table_desc = schema_entry.get("table_description", "")
|
||||
desc_str = f" — {table_desc}" if table_desc else ""
|
||||
lines.append(f" - **{table}/** ({len(info['files'])} files, {fmt_size(info['total_size_bytes'])}{col_str}){desc_str}")
|
||||
lines.append("")
|
||||
|
||||
lines += [
|
||||
"---",
|
||||
f"**Total: {len(inventory)} tables · {fmt_size(total_bytes)} · {total_files} parquet files**",
|
||||
]
|
||||
|
||||
with open("file_tree.md", "w", encoding="utf-8") as f:
|
||||
f.write("\n".join(lines) + "\n")
|
||||
|
||||
print(f" ✓ file_tree.md ({len(inventory)} tables)")
|
||||
print()
|
||||
print("Done!")
|
||||
print(f" schemas.json — full column-level schema dump")
|
||||
print(f" file_tree.md — bucket tree with sizes")
|
||||
if errors:
|
||||
print(f" {len(errors)} tables failed (see schemas.json _meta.errors)")
|
||||
@@ -1,4 +0,0 @@
|
||||
duckdb
|
||||
boto3
|
||||
python-dotenv
|
||||
openai
|
||||
42
scripts/build_ask.sh
Executable file
42
scripts/build_ask.sh
Executable file
@@ -0,0 +1,42 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
cd "$(dirname "$0")"
|
||||
|
||||
echo "=== Building ask binary for Linux x86_64 ==="
|
||||
echo "Using Debian x86_64 container for native build..."
|
||||
|
||||
# Build in an x86_64 Debian container - this gives us a real x86_64 environment
|
||||
# so we can build natively without cross-compilation complexity
|
||||
# Use ask/ as context to avoid .dockerignore excluding src/
|
||||
docker build \
|
||||
--platform linux/amd64 \
|
||||
-t ask-builder \
|
||||
--build-arg BUILDKIT_INLINE_CACHE=1 \
|
||||
-f - ask/ <<'EOF'
|
||||
FROM rust:1.85-slim
|
||||
|
||||
RUN apt-get update -qq && \
|
||||
apt-get install -y --no-install-recommends \
|
||||
build-essential pkg-config libssl-dev && \
|
||||
apt-get clean && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
WORKDIR /build
|
||||
|
||||
COPY . ./
|
||||
RUN cargo build --release --locked
|
||||
|
||||
FROM scratch
|
||||
COPY --from=0 /build/target/release/ask /ask
|
||||
EOF
|
||||
|
||||
echo "=== Extracting binary ==="
|
||||
# Extract the binary from the container
|
||||
docker run --rm --platform linux/amd64 ask-builder cat /ask > ./ask/target/release/ask
|
||||
|
||||
# Make it executable
|
||||
chmod +x ./ask/target/release/ask
|
||||
|
||||
echo "=== Binary built successfully ==="
|
||||
file ./ask/target/release/ask
|
||||
ls -lh ./ask/target/release/ask
|
||||
@@ -62,7 +62,8 @@ if $GCLOUD_RUN; then
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
else
|
||||
elif ! $SYNC_RUN; then
|
||||
# Only require heavy GCP tools for the main export (not for --sync)
|
||||
for cmd in bq gcloud gsutil parallel rclone flock; do
|
||||
if ! command -v "$cmd" &>/dev/null; then
|
||||
log_err "'$cmd' not found. Install google-cloud-sdk, GNU parallel, and rclone."
|
||||
@@ -164,8 +165,8 @@ echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.clou
|
||||
| sudo tee /etc/apt/sources.list.d/google-cloud-sdk.list >/dev/null
|
||||
sudo apt-get update -qq
|
||||
sudo apt-get install -y google-cloud-cli
|
||||
chmod +x ~/roda.sh
|
||||
echo "Dependencies installed."
|
||||
chmod +x ~/roda.sh
|
||||
echo "Dependencies installed."
|
||||
REMOTE_SETUP
|
||||
log " Dependencies ready."
|
||||
|
||||
@@ -197,6 +198,121 @@ REMOTE_SETUP
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# =============================================================================
|
||||
# VM EXPORT — use existing bd-export-vm to export specific tables to GCS → S3
|
||||
# =============================================================================
|
||||
if [[ "${1:-}" == "--vm-export" ]]; then
|
||||
VM_NAME="${GCP_VM_NAME:-bd-export-vm}"
|
||||
VM_ZONE="${GCP_VM_ZONE:-us-central1-a}"
|
||||
VM_PROJECT="${GCP_VM_PROJECT:-raspa-491716}"
|
||||
TABLE_LIST="${2:-missing_tables.txt}"
|
||||
|
||||
log "=============================="
|
||||
log " VM EXPORT MODE"
|
||||
log " VM: $VM_NAME ($VM_ZONE)"
|
||||
log " Tables: $TABLE_LIST"
|
||||
log "=============================="
|
||||
|
||||
if [[ ! -f "$TABLE_LIST" ]]; then
|
||||
log_err "Table list not found: $TABLE_LIST"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
log "[1/5] Syncing files to VM..."
|
||||
gcloud compute scp \
|
||||
"$(dirname "$0")/roda.sh" \
|
||||
"$(dirname "$0")/.env" \
|
||||
"$(realpath "$TABLE_LIST")" \
|
||||
"$VM_NAME:~/" \
|
||||
--zone="$VM_ZONE" \
|
||||
--project="$VM_PROJECT"
|
||||
|
||||
log "[2/5] Ensuring GCS bucket exists..."
|
||||
if ! gsutil ls "gs://$BUCKET_NAME" &>/dev/null; then
|
||||
gsutil mb -p "$VM_PROJECT" -l "$BUCKET_REGION" -b on "gs://$BUCKET_NAME"
|
||||
log " Bucket created: gs://$BUCKET_NAME"
|
||||
else
|
||||
log " Bucket already exists."
|
||||
fi
|
||||
|
||||
log "[3/5] Running export on VM (bq extract + rclone)..."
|
||||
gcloud compute ssh "$VM_NAME" \
|
||||
--zone="$VM_ZONE" \
|
||||
--project="$VM_PROJECT" \
|
||||
--command="bash -s" <<'REMOTE_EXPORT'
|
||||
set -euo pipefail
|
||||
export DEBIAN_FRONTEND=noninteractive
|
||||
cd ~
|
||||
set -a
|
||||
# shellcheck source=.env
|
||||
source .env
|
||||
set +a
|
||||
source ~/.bashrc 2>/dev/null || true
|
||||
|
||||
export RCLONE_CONFIG_BD_TYPE="google cloud storage"
|
||||
export RCLONE_CONFIG_BD_BUCKET_POLICY_ONLY="true"
|
||||
export RCLONE_CONFIG_HZ_TYPE="s3"
|
||||
export RCLONE_CONFIG_HZ_PROVIDER="Other"
|
||||
export RCLONE_CONFIG_HZ_ENDPOINT="$HETZNER_S3_ENDPOINT"
|
||||
export RCLONE_CONFIG_HZ_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID"
|
||||
export RCLONE_CONFIG_HZ_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY"
|
||||
|
||||
echo "[BQ EXTRACT] Starting export of missing tables..."
|
||||
|
||||
extract_table() {
|
||||
local table="$1"
|
||||
local dataset table_id gcs_prefix
|
||||
dataset=$(echo "$table" | cut -d. -f1)
|
||||
table_id=$(echo "$table" | cut -d. -f2)
|
||||
gcs_prefix="gs://$BUCKET_NAME/$dataset/$table_id"
|
||||
|
||||
echo "[EXTRACT] $table"
|
||||
bq extract \
|
||||
--project_id="$YOUR_PROJECT" \
|
||||
--destination_format=PARQUET \
|
||||
--compression=ZSTD \
|
||||
--location=US \
|
||||
"${SOURCE_PROJECT}:${dataset}.${table_id}" \
|
||||
"${gcs_prefix}/*.parquet" 2>&1 \
|
||||
|| echo "[FAIL] $table"
|
||||
}
|
||||
|
||||
export -f extract_table
|
||||
export BUCKET_NAME SOURCE_PROJECT
|
||||
|
||||
cat missing_tables.txt | parallel -j8 --bar extract_table {}
|
||||
|
||||
echo "[TRANSFER] GCS → Hetzner S3..."
|
||||
datasets=$(gsutil ls "gs://$BUCKET_NAME/" 2>/dev/null | sed 's|gs://[^/]*/||;s|/$||' | grep -v '^$' | sort -u)
|
||||
for ds in $datasets; do
|
||||
echo "[TRANSFER] $ds"
|
||||
rclone copy "bd:$BUCKET_NAME/$ds/" "hz:$HETZNER_S3_BUCKET/$ds/" \
|
||||
--transfers 32 --s3-upload-concurrency 32 --progress 2>&1 \
|
||||
|| echo "[FAIL_TRANSFER] $ds"
|
||||
done
|
||||
|
||||
echo "[DONE] Export complete."
|
||||
REMOTE_EXPORT
|
||||
|
||||
log "[4/5] Verifying transfer..."
|
||||
S3_COUNT=$(gcloud compute ssh "$VM_NAME" \
|
||||
--zone="$VM_ZONE" \
|
||||
--project="$VM_PROJECT" \
|
||||
--command="source .env && rclone ls hz:\$HETZNER_S3_BUCKET 2>/dev/null | grep -c '\.parquet\$' || echo 0" 2>/dev/null)
|
||||
log " S3 parquet files: $S3_COUNT"
|
||||
|
||||
log "[5/5] Cleaning up GCS bucket..."
|
||||
read -rp "Delete GCS bucket gs://$BUCKET_NAME? [y/N] " confirm
|
||||
if [[ "$confirm" =~ ^[Yy]$ ]]; then
|
||||
gsutil -m rm -r "gs://$BUCKET_NAME"
|
||||
gsutil rb "gs://$BUCKET_NAME"
|
||||
log " Bucket deleted."
|
||||
fi
|
||||
|
||||
log "VM export complete."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# =============================================================================
|
||||
# SYNC — BigQuery → S3 direct (no GCS intermediary)
|
||||
# =============================================================================
|
||||
6
start.sh
6
start.sh
@@ -19,13 +19,13 @@ SQL
|
||||
chmod 600 /app/ssh_init.sql
|
||||
|
||||
echo "[start] Starting ttyd terminal (db)..."
|
||||
ttyd --port 7681 --writable duckdb -readonly --init /app/ssh_init.sql /app/basedosdados.duckdb &
|
||||
ttyd --port 7681 --writable duckdb -readonly --init /app/ssh_init.sql /app/data/basedosdados.duckdb &
|
||||
|
||||
echo "[start] Starting ttyd terminal (ask)..."
|
||||
ttyd --port 7682 --writable python3 /app/ask.py &
|
||||
ttyd --port 7682 --writable /app/ask &
|
||||
|
||||
echo "[start] Starting auth service..."
|
||||
python3 /app/auth.py &
|
||||
python3 /app/shell/auth.py &
|
||||
|
||||
echo "[start] Starting Caddy..."
|
||||
exec caddy run --config /app/Caddyfile --adapter caddyfile
|
||||
|
||||
@@ -1,543 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
sync_bq_to_local.py
|
||||
|
||||
Syncs missing tables from BigQuery (basedosdados project) to Hetzner S3,
|
||||
then registers them as DuckDB views.
|
||||
|
||||
Usage:
|
||||
python3 sync_bq_to_local.py # full sync
|
||||
python3 sync_bq_to_local.py --dry-run # list missing tables only
|
||||
python3 sync_bq_to_local.py --resume # resume from last run
|
||||
|
||||
Prerequisites:
|
||||
gcloud auth application-default login
|
||||
GCP project with billing enabled (free tier: 1 TB/month)
|
||||
|
||||
Environment (.env):
|
||||
GCP_PROJECT - GCP project ID for billing
|
||||
HETZNER_S3_BUCKET - S3 bucket name
|
||||
HETZNER_S3_ENDPOINT - S3 endpoint URL
|
||||
AWS_ACCESS_KEY_ID - S3 access key
|
||||
AWS_SECRET_ACCESS_KEY - S3 secret key
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import argparse
|
||||
import logging
|
||||
import subprocess
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
import boto3
|
||||
from botocore.config import Config as BotoConfig
|
||||
from google.cloud import bigquery
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Logging
|
||||
# ---------------------------------------------------------------------------
|
||||
LOG_FILE = f"sync_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(message)s",
|
||||
handlers=[
|
||||
logging.FileHandler(LOG_FILE),
|
||||
logging.StreamHandler(sys.stdout),
|
||||
],
|
||||
)
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Constants
|
||||
# ---------------------------------------------------------------------------
|
||||
SOURCE_PROJECT = "basedosdados"
|
||||
MISSING_TABLES_FILE = "tasks/datasets_to_scrap.md"
|
||||
DONE_FILE = "done_sync.txt"
|
||||
FAILED_FILE = "failed_sync.txt"
|
||||
DATA_DIR = "data"
|
||||
PARQUET_DIR = "parquet"
|
||||
MAX_RETRIES = 3
|
||||
BATCH_SIZE = 1 # export one table at a time to manage memory
|
||||
WORKERS = 4 # parallel uploads
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def load_env():
|
||||
"""Load required environment variables."""
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv()
|
||||
|
||||
required = [
|
||||
"GCP_PROJECT",
|
||||
"HETZNER_S3_BUCKET",
|
||||
"HETZNER_S3_ENDPOINT",
|
||||
"AWS_ACCESS_KEY_ID",
|
||||
"AWS_SECRET_ACCESS_KEY",
|
||||
]
|
||||
missing = [v for v in required if not os.environ.get(v)]
|
||||
if missing:
|
||||
log.error("Missing env vars: %s", missing)
|
||||
sys.exit(1)
|
||||
|
||||
return {v: os.environ[v] for v in required}
|
||||
|
||||
|
||||
def get_s3_client(env):
|
||||
"""Create boto3 S3 client configured for Hetzner."""
|
||||
return boto3.client(
|
||||
"s3",
|
||||
endpoint_url=env["HETZNER_S3_ENDPOINT"],
|
||||
aws_access_key_id=env["AWS_ACCESS_KEY_ID"],
|
||||
aws_secret_access_key=env["AWS_SECRET_ACCESS_KEY"],
|
||||
config=BotoConfig(s3={"addressing_style": "path"}),
|
||||
)
|
||||
|
||||
|
||||
def get_bq_client():
|
||||
"""Create BigQuery client using Application Default Credentials."""
|
||||
try:
|
||||
os.environ["GOOGLE_CLOUD_PROJECT"] = os.environ.get("GCP_PROJECT", "")
|
||||
os.environ["GCLOUD_PROJECT"] = os.environ.get("GCP_PROJECT", "")
|
||||
client = bigquery.Client(project=os.environ.get("GCP_PROJECT", ""))
|
||||
# Test the connection
|
||||
list(client.list_datasets(max_results=1))
|
||||
return client
|
||||
except Exception as e:
|
||||
log.error("BigQuery auth failed: %s", e)
|
||||
log.error("")
|
||||
log.error("Run these commands to authenticate:")
|
||||
log.error(" gcloud auth login")
|
||||
log.error(" gcloud auth application-default login")
|
||||
log.error(" gcloud config set project %s", os.environ.get("GCP_PROJECT", ""))
|
||||
log.error("")
|
||||
log.error("The free tier (1 TB/month) is sufficient — no credit card needed.")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def list_bq_tables(bq_client):
|
||||
"""List all tables in the basedosdados BigQuery project."""
|
||||
log.info("Discovering tables in BigQuery project: %s", SOURCE_PROJECT)
|
||||
tables = {}
|
||||
|
||||
try:
|
||||
datasets = list(bq_client.list_datasets())
|
||||
log.info("Found %d datasets", len(datasets))
|
||||
except Exception as e:
|
||||
log.error("Failed to list datasets: %s", e)
|
||||
sys.exit(1)
|
||||
|
||||
for dataset in datasets:
|
||||
try:
|
||||
tables_list = list(
|
||||
bq_client.list_tables(
|
||||
f"{SOURCE_PROJECT}.{dataset.dataset_id}",
|
||||
max_results=10000,
|
||||
)
|
||||
)
|
||||
for t in tables_list:
|
||||
tables[f"{dataset.dataset_id}.{t.table_id}"] = {
|
||||
"dataset": dataset.dataset_id,
|
||||
"table": t.table_id,
|
||||
"full_id": f"{SOURCE_PROJECT}.{dataset.dataset_id}.{t.table_id}",
|
||||
"schema": [f.name for f in t.schema] if t.schema else [],
|
||||
"num_bytes": t.num_bytes,
|
||||
"num_rows": t.num_rows,
|
||||
}
|
||||
except Exception as e:
|
||||
log.warning("Failed to list tables in dataset %s: %s", dataset.dataset_id, e)
|
||||
|
||||
log.info("Total BigQuery tables discovered: %d", len(tables))
|
||||
return tables
|
||||
|
||||
|
||||
def list_s3_tables(s3_client, bucket):
|
||||
"""List datasets/tables already exported to S3."""
|
||||
log.info("Discovering tables already in S3 bucket: %s", bucket)
|
||||
table_files = defaultdict(lambda: defaultdict(list))
|
||||
|
||||
try:
|
||||
paginator = s3_client.get_paginator("list_objects_v2")
|
||||
for page in paginator.paginate(Bucket=bucket):
|
||||
for obj in page.get("Contents", []):
|
||||
key = obj["Key"]
|
||||
if not key.endswith(".parquet"):
|
||||
continue
|
||||
parts = key.split("/")
|
||||
if len(parts) >= 3:
|
||||
dataset, table = parts[0], parts[1]
|
||||
table_files[dataset][table].append(key)
|
||||
except Exception as e:
|
||||
log.warning("S3 listing error (may be empty bucket): %s", e)
|
||||
|
||||
tables = {}
|
||||
for dataset, t_dict in table_files.items():
|
||||
for table, files in t_dict.items():
|
||||
tables[f"{dataset}.{table}"] = files
|
||||
|
||||
log.info("Total S3 tables discovered: %d", len(tables))
|
||||
return tables
|
||||
|
||||
|
||||
def parse_missing_tables_from_md(filepath):
|
||||
"""Parse the missing tables from tasks/datasets_to_scrap.md.
|
||||
|
||||
Returns a dict mapping 'dataset.table' -> description.
|
||||
Falls back to None (use all non-S3 tables) if file not found.
|
||||
"""
|
||||
if not os.path.exists(filepath):
|
||||
log.warning("Missing file %s, using all non-S3 tables", filepath)
|
||||
return None
|
||||
|
||||
log.info("Parsing missing tables from %s", filepath)
|
||||
with open(filepath) as f:
|
||||
content = f.read()
|
||||
|
||||
missing = {}
|
||||
lines = content.split("\n")
|
||||
i = 0
|
||||
|
||||
def next_nonempty(lines, i):
|
||||
while i < len(lines) and not lines[i].strip():
|
||||
i += 1
|
||||
return i
|
||||
|
||||
while i < len(lines):
|
||||
line = lines[i].strip()
|
||||
|
||||
# Find the Basedosdados.org section
|
||||
if "Basedosdados.org" in line and "Not in basedosdados.duckdb" in line:
|
||||
log.info("Found Basedosdados.org section at line %d", i + 1)
|
||||
i += 1
|
||||
break
|
||||
i += 1
|
||||
|
||||
# Now parse table entries
|
||||
while i < len(lines):
|
||||
line = lines[i].strip()
|
||||
|
||||
# End of section only on top-level ## headers, not ### subsections
|
||||
if line.startswith("## "):
|
||||
break
|
||||
|
||||
# Skip separators and empty lines
|
||||
if not line or line.startswith("---") or "|---" in line:
|
||||
i += 1
|
||||
continue
|
||||
|
||||
# Find rows with backtick-wrapped dataset names (e.g. | `br_abrinq_oca` | ...)
|
||||
if "`" in line and "|" in line:
|
||||
# Split by pipe, strip whitespace and backticks
|
||||
parts = [p.strip().strip("`").strip() for p in line.split("|")]
|
||||
# Filter empty parts
|
||||
parts = [p for p in parts if p]
|
||||
|
||||
if len(parts) >= 2:
|
||||
dataset_raw = parts[0]
|
||||
# Check if it looks like a dataset name (br_*, eu_*, mundo_*, etc.)
|
||||
is_dataset = any(
|
||||
dataset_raw.startswith(prefix)
|
||||
for prefix in ("br_", "eu_", "mundo_", "nl_", "world_")
|
||||
)
|
||||
|
||||
if is_dataset:
|
||||
# parts[1] contains the missing table names (comma-separated)
|
||||
tables_raw = parts[1]
|
||||
for tbl in tables_raw.split(","):
|
||||
tbl = tbl.strip()
|
||||
# Clean up: remove parenthetical notes, trailing text
|
||||
if "(" in tbl:
|
||||
tbl = tbl.split("(")[0].strip()
|
||||
if tbl and not tbl.startswith("-"):
|
||||
missing[f"{dataset_raw}.{tbl}"] = f"from {filepath}"
|
||||
|
||||
i += 1
|
||||
|
||||
log.info("Parsed %d missing table references from MD", len(missing))
|
||||
return missing if missing else None
|
||||
|
||||
|
||||
def compute_missing_tables(bq_tables, s3_tables, md_missing):
|
||||
"""Compute which tables need to be synced."""
|
||||
if md_missing is None:
|
||||
log.info("No MD file, computing diff: BQ - S3")
|
||||
return [
|
||||
(table_id, info)
|
||||
for table_id, info in bq_tables.items()
|
||||
if table_id not in s3_tables
|
||||
]
|
||||
|
||||
log.info("Computing sync targets: MD missing tables not in S3")
|
||||
targets = []
|
||||
for key, info in bq_tables.items():
|
||||
if key in s3_tables:
|
||||
continue
|
||||
if key in md_missing:
|
||||
targets.append((key, info))
|
||||
else:
|
||||
# Table not in S3 but not in MD missing list
|
||||
# Check if its dataset is partially covered
|
||||
dataset = info["dataset"]
|
||||
table = info["table"]
|
||||
# If any table from this dataset is in MD missing, include it
|
||||
dataset_in_md = any(
|
||||
k.startswith(f"{dataset}.") and k.split(".", 1)[1] in md_missing
|
||||
for k in bq_tables
|
||||
)
|
||||
if not dataset_in_md:
|
||||
targets.append((key, info))
|
||||
|
||||
return targets
|
||||
|
||||
|
||||
def estimate_size_mb(num_bytes):
|
||||
"""Estimate size in MB."""
|
||||
if num_bytes is None:
|
||||
return "?"
|
||||
return f"{num_bytes / 1_048_576:.1f}"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Export logic
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def sync_table(args, table_id, info, dry_run=False):
|
||||
"""Sync a single table: BQ → parquet → S3 → DuckDB view."""
|
||||
bq_client, s3_client, bucket = args
|
||||
dataset = info["dataset"]
|
||||
table = info["table"]
|
||||
full_id = info["full_id"]
|
||||
|
||||
s3_key_prefix = f"{dataset}/{table}"
|
||||
|
||||
if dry_run:
|
||||
size_mb = estimate_size_mb(info.get("num_bytes"))
|
||||
return True, f"[DRY] {dataset}.{table} (~{size_mb} MB)"
|
||||
|
||||
# Step 1: Query from BigQuery
|
||||
log.info("Querying %s from BigQuery", full_id)
|
||||
query = f"SELECT * FROM `{full_id}`"
|
||||
|
||||
try:
|
||||
query_job = bq_client.query(query, location="US")
|
||||
df = query_job.to_dataframe()
|
||||
except Exception as e:
|
||||
return False, f"BQ query failed for {table_id}: {e}"
|
||||
|
||||
if df.empty:
|
||||
return True, f"[SKIP] {table_id} — empty table"
|
||||
|
||||
if df.shape[0] > 10_000_000:
|
||||
log.warning("Table %s has %d rows — may be slow/memory-intensive", table_id, df.shape[0])
|
||||
|
||||
# Step 2: Write to parquet in memory, then upload
|
||||
import io
|
||||
import pyarrow as pa
|
||||
import pyarrow.parquet as pq
|
||||
|
||||
buffer = io.BytesIO()
|
||||
table_pa = pa.Table.from_pandas(df)
|
||||
|
||||
# Write with zstd compression
|
||||
writer = pq.ParquetWriter(
|
||||
buffer,
|
||||
table_pa.schema,
|
||||
compression="zstd",
|
||||
use_dictionary=True,
|
||||
)
|
||||
writer.write_table(table_pa)
|
||||
writer.close()
|
||||
buffer.seek(0)
|
||||
|
||||
s3_key = f"{s3_key_prefix}/{table}.parquet"
|
||||
log.info("Uploading %s → s3://%s/%s (%s, %d rows)",
|
||||
table_id, bucket, s3_key,
|
||||
f"{buffer.getbuffer().nbytes / 1_048_576:.1f} MB",
|
||||
df.shape[0])
|
||||
|
||||
try:
|
||||
s3_client.upload_fileobj(
|
||||
buffer,
|
||||
bucket,
|
||||
s3_key,
|
||||
ExtraArgs={"ContentType": "application/octet-stream"},
|
||||
)
|
||||
except Exception as e:
|
||||
return False, f"S3 upload failed for {table_id}: {e}"
|
||||
|
||||
log.info("[DONE] %s uploaded to s3://%s/%s", table_id, bucket, s3_key)
|
||||
return True, f"[DONE] {table_id}"
|
||||
|
||||
|
||||
def update_duckdb_view(env, table_id, info):
|
||||
"""Register a new table as a DuckDB view over S3 parquet."""
|
||||
import duckdb
|
||||
|
||||
dataset = info["dataset"]
|
||||
table = info["table"]
|
||||
bucket = env["HETZNER_S3_BUCKET"]
|
||||
endpoint = env["HETZNER_S3_ENDPOINT"].removeprefix("https://").removeprefix("http://")
|
||||
access_key = env["AWS_ACCESS_KEY_ID"]
|
||||
secret_key = env["AWS_SECRET_ACCESS_KEY"]
|
||||
|
||||
# S3 path
|
||||
s3_path = f"s3://{bucket}/{dataset}/{table}/{table}.parquet"
|
||||
|
||||
try:
|
||||
con = duckdb.connect("basedosdados.duckdb", read_only=False)
|
||||
con.execute("INSTALL httpfs; LOAD httpfs;")
|
||||
con.execute(f"SET s3_endpoint='{endpoint}';")
|
||||
con.execute(f"SET s3_access_key_id='{access_key}';")
|
||||
con.execute(f"SET s3_secret_access_key='{secret_key}';")
|
||||
con.execute(f"SET s3_url_style='path';")
|
||||
con.execute(f"CREATE SCHEMA IF NOT EXISTS {dataset}")
|
||||
con.execute(f"""
|
||||
CREATE OR REPLACE VIEW {dataset}.{table} AS
|
||||
SELECT * FROM read_parquet('{s3_path}', hive_partitioning=true, union_by_name=true)
|
||||
""")
|
||||
con.close()
|
||||
log.info("[DUCKDB] View created: %s.%s", dataset, table)
|
||||
return True, None
|
||||
except Exception as e:
|
||||
log.error("[DUCKDB] Failed to create view %s.%s: %s", dataset, table, e)
|
||||
return False, str(e)
|
||||
|
||||
|
||||
def run_sync(targets, args, env, dry_run=False, resume=False):
|
||||
"""Run the sync for all target tables."""
|
||||
s3_client = get_s3_client(env)
|
||||
bq_client = get_bq_client()
|
||||
|
||||
# Load done/failed tracking
|
||||
done_set = set()
|
||||
if resume:
|
||||
if os.path.exists(DONE_FILE):
|
||||
with open(DONE_FILE) as f:
|
||||
done_set = {l.strip() for l in f if l.strip()}
|
||||
log.info("Resuming: %d tables already done", len(done_set))
|
||||
|
||||
failed_count = 0
|
||||
done_count = 0
|
||||
|
||||
# Filter out already-done tables
|
||||
targets = [(tid, info) for tid, info in targets if tid not in done_set]
|
||||
|
||||
if not targets:
|
||||
log.info("No tables to sync.")
|
||||
return 0, 0
|
||||
|
||||
log.info("Syncing %d tables...", len(targets))
|
||||
|
||||
for i, (table_id, info) in enumerate(targets, 1):
|
||||
log.info("--- [%d/%d] Syncing %s ---", i, len(targets), table_id)
|
||||
|
||||
# Sync BQ → S3
|
||||
ok, msg = sync_table(
|
||||
(bq_client, s3_client, env["HETZNER_S3_BUCKET"]),
|
||||
table_id,
|
||||
info,
|
||||
dry_run=dry_run,
|
||||
)
|
||||
log.info(msg)
|
||||
|
||||
if dry_run:
|
||||
continue
|
||||
|
||||
if not ok:
|
||||
with open(FAILED_FILE, "a") as f:
|
||||
f.write(f"{table_id}\t{msg}\n")
|
||||
failed_count += 1
|
||||
continue
|
||||
|
||||
if "empty" in msg.lower():
|
||||
continue
|
||||
|
||||
# Update DuckDB view
|
||||
ok, err = update_duckdb_view(env, table_id, info)
|
||||
if not ok:
|
||||
with open(FAILED_FILE, "a") as f:
|
||||
f.write(f"{table_id}\tDUCKDB: {err}\n")
|
||||
|
||||
# Mark done
|
||||
with open(DONE_FILE, "a") as f:
|
||||
f.write(f"{table_id}\n")
|
||||
done_count += 1
|
||||
|
||||
return done_count, failed_count
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Sync missing BQ tables to S3")
|
||||
parser.add_argument("--dry-run", action="store_true", help="List tables without syncing")
|
||||
parser.add_argument("--resume", action="store_true", help="Resume from last run")
|
||||
args = parser.parse_args()
|
||||
|
||||
env = load_env()
|
||||
dry_run = args.dry_run
|
||||
|
||||
if dry_run:
|
||||
log.info("=== DRY RUN MODE ===")
|
||||
|
||||
# Step 1: List BigQuery tables
|
||||
bq_client = get_bq_client()
|
||||
bq_tables = list_bq_tables(bq_client)
|
||||
|
||||
# Step 2: List S3 tables
|
||||
s3_client = get_s3_client(env)
|
||||
s3_tables = list_s3_tables(s3_client, env["HETZNER_S3_BUCKET"])
|
||||
|
||||
# Step 3: Parse missing tables from MD
|
||||
md_missing = parse_missing_tables_from_md(MISSING_TABLES_FILE)
|
||||
|
||||
# Step 4: Compute targets
|
||||
targets = compute_missing_tables(bq_tables, s3_tables, md_missing)
|
||||
|
||||
if not targets:
|
||||
log.info("No tables to sync.")
|
||||
return
|
||||
|
||||
log.info("")
|
||||
log.info("============================================")
|
||||
log.info(" Tables to sync: %d", len(targets))
|
||||
log.info("============================================")
|
||||
for i, (table_id, info) in enumerate(targets, 1):
|
||||
size_mb = estimate_size_mb(info.get("num_bytes"))
|
||||
md_note = md_missing.get(table_id, "")
|
||||
log.info(" [%d] %-50s %6s MB %s", i, table_id, size_mb, md_note)
|
||||
log.info("")
|
||||
|
||||
if dry_run:
|
||||
total_bytes = sum(info.get("num_bytes", 0) or 0 for _, info in targets)
|
||||
total_gb = total_bytes / 1_073_741_824
|
||||
log.info("Total estimated size: %.2f GB (BigQuery compressed bytes)", total_gb)
|
||||
log.info("Run without --dry-run to start syncing.")
|
||||
return
|
||||
|
||||
# Step 5: Run sync
|
||||
log.info("Starting sync...")
|
||||
done_count, failed_count = run_sync(targets, None, env, dry_run=False, resume=args.resume)
|
||||
|
||||
log.info("")
|
||||
log.info("============================================")
|
||||
log.info(" Sync complete!")
|
||||
log.info(" Done: %d tables", done_count)
|
||||
log.info(" Failed: %d tables", failed_count)
|
||||
log.info(" Log: %s", LOG_FILE)
|
||||
log.info("============================================")
|
||||
|
||||
if failed_count > 0:
|
||||
log.info("Failed tables: see %s", FAILED_FILE)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -143,11 +143,36 @@ Sources from https://github.com/jxnxts/mcp-brasil not in `basedosdados.duckdb`.
|
||||
| INPE | `inpe` | none | `https://terrabrasilis.dpi.inpe.br/queimadas/bdqueimadas-data-service` | JSON |
|
||||
| Tabua Mares | `tabua_mares` | none | `https://tabuademares.com/api/v2` | JSON |
|
||||
|
||||
## Basedosdados.org — Not in basedosdados.duckdb (232 tables)
|
||||
## Basedosdados.org — Not in basedosdados.duckdb
|
||||
|
||||
Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and thus in your duckdb). The following datasets have **zero or partial** tables in duckdb.
|
||||
Basedosdados.org has **765 tables** on BigQuery, **~533** on S3. The remaining gap:
|
||||
|
||||
### Full datasets — no tables in duckdb
|
||||
- **2 TABLEs** need `bq extract` → GCS → S3 (waiting on GCP billing restore)
|
||||
- **~230 are VIEWs** → need `bq query` to materialize, then `bq extract` (or streaming write to S3)
|
||||
- **3 tables MISSING** from BQ entirely (br_bcb_sicor microdados_* don't exist)
|
||||
|
||||
### Need export — 2 TABLEs blocked on GCP billing
|
||||
|
||||
| Dataset | Table | BQ Type | Notes |
|
||||
|---------|-------|---------|-------|
|
||||
| `br_bcb_taxa_cambio` | taxa_cambio | TABLE | ✅ `bq extract` works |
|
||||
| `br_bcb_taxa_selic` | taxa_selic | TABLE | ✅ `bq extract` works |
|
||||
|
||||
### Already on S3 (no action needed)
|
||||
|
||||
| Dataset | Tables |
|
||||
|---------|--------|
|
||||
| `br_bd_metadados` | bigquery_tables, prefect_flow_runs |
|
||||
| `br_fbsp_absp` | uf, violencia_escola |
|
||||
| `br_ibge_estadic` | dicionario |
|
||||
| `br_camara_dados_abertos` | all 33 tables (222 parquet files) |
|
||||
| `br_me_rais` | dicionario, microdados_estabelecimentos, microdados_vinculos |
|
||||
|
||||
### ~230 VIEWs — need bq query materialization pipeline
|
||||
|
||||
Cannot `bq extract` directly. Need to: (1) materialize via `bq query --destination_table`, or (2) stream via Python Arrow → S3 directly.
|
||||
|
||||
#### Full datasets (all VIEWs)
|
||||
|
||||
| Dataset | Tables missing | Notes |
|
||||
|---------|----------------|-------|
|
||||
@@ -157,7 +182,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
|
||||
| `br_anvisa_medicamentos_industrializados` | microdados | |
|
||||
| `br_ba_feiradesantana_camara_leis` | microdados | |
|
||||
| `br_bd_diretorios_data_tempo` | tempo, data, ano, mes, dia, hora, bimestre, trimestre, semestre, minuto, segundo | Directory of time dimensions |
|
||||
| `br_bd_metadados` | external_links, information_requests, organizations, prefect_flows, resources, tables | BD metadata catalog |
|
||||
| `br_bd_metadados` | external_links, information_requests, organizations, resources, tables | |
|
||||
| `br_bd_vizinhanca` | municipio, uf | |
|
||||
| `br_caixa_sorteios` | megasena | |
|
||||
| `br_camara_dados_abertos` | sigla_partido | |
|
||||
@@ -179,7 +204,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
|
||||
| `br_ieps_saude` | brasil, macrorregiao, municipio, regiao_saude, uf | |
|
||||
| `br_imprensa_nacional_dou` | secao_1, secao_2, secao_3 | Official gazette sections |
|
||||
| `br_ipea_acesso_oportunidades` | estatisticas_2019, indicadores_2019 | |
|
||||
| `br_mapbiomas_estatisticas` | classe, cobertura_municipio_classe, cobertura_uf_classe, transicao_municipio_de_para_anual/decenal/quinquenal, transicao_uf_de_para_anual/decenal/quinquenal | |
|
||||
| `br_mapbiomas_estatisticas` | classe, cobertura_municipio_classe, cobertura_uf_classe, transicao_*(anual/decenal/quinquenal) | |
|
||||
| `br_mc_indicadores` | transferencias_municipio | |
|
||||
| `br_me_clima_organizacional` | microdados | |
|
||||
| `br_me_estoque_divida_publica` | microdados | |
|
||||
@@ -188,7 +213,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
|
||||
| `br_me_siape` | servidores_executivo_federal | |
|
||||
| `br_me_siorg` | remuneracao | |
|
||||
| `br_mma_extincao` | fauna_ameacada, flora_ameacada | |
|
||||
| `br_mobilidados_indicadores` | 11 tables (comprometimento_renda_tarifa_transp_publico, proporcao_*, taxa_motorizacao, etc.) | |
|
||||
| `br_mobilidados_indicadores` | 11 tables | |
|
||||
| `br_ms_atencao_basica` | municipio | |
|
||||
| `br_ms_imunizacoes` | municipio | |
|
||||
| `br_ons_energia_armazenada` | subsistemas | |
|
||||
@@ -219,18 +244,16 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
|
||||
| `world_ti_corruption_perception` | country | |
|
||||
| `world_wb_wwbi` | country_finance, country_indicators | |
|
||||
|
||||
### Partial datasets — some tables in duckdb, some missing
|
||||
#### Partial datasets — missing tables (all VIEWs, except where noted)
|
||||
|
||||
| Dataset | Missing tables | In duckdb |
|
||||
|---------|----------------|-----------|
|
||||
| `br_anatel_banda_larga_fixa` | backhaul, pble | densidade_*, microdados |
|
||||
| `br_bcb_sicor` | microdados_liberacao, microdados_operacao, microdados_saldo | dicionario, liberacao, operacao, saldo, recurso_publico_* |
|
||||
| `br_bcb_taxa_cambio` | taxa_cambio | — (ACCESS_DENIED) |
|
||||
| `br_bcb_taxa_selic` | taxa_selic | — (ACCESS_DENIED) |
|
||||
| `br_bcb_sicor` | microdados_liberacao, microdados_operacao, microdados_saldo | dicionario, liberacao, operacao, saldo, + 5 more TABLEs |
|
||||
| `br_ibge_pib` | brasil_antigo, municipio_antigo, regiao_antigo, uf, uf_antigo | gini, municipio |
|
||||
| `br_ibge_pnad_covid` | microdados | dicionario |
|
||||
| `br_ibge_pnadc` | ano_brasil_grupo_idade, ano_brasil_raca_cor, ano_municipio_*, ano_regiao_*, ano_uf_* (cross-tabs) | dicionario, educacao, microdados, rendimentos_outras_fontes |
|
||||
| `br_ibge_pof` | all 17 tables (morador, domicilio, despesa_*, consumo_*, etc.) | none |
|
||||
| `br_ibge_pnadc` | 10 cross-tab tables (ano_*) | dicionario, educacao, microdados, rendimentos_outras_fontes |
|
||||
| `br_ibge_pof` | all 17 tables (morador_*, domicilio_*, despesa_*, consumo_*, etc.) | none |
|
||||
| `br_inep_ana` | aluno, escola, prova | dicionario |
|
||||
| `br_inep_censo_escolar` | docente, matricula | dicionario, escola, turma |
|
||||
| `br_inep_formacao_docente` | brasil, escola, municipio, regiao, uf | dicionario |
|
||||
@@ -238,8 +261,7 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
|
||||
| `br_inep_indicadores_educacionais` | escola_nivel_socioeconomico, fluxo_educacao_superior | all others |
|
||||
| `br_inmet_bdmep` | estacao | microdados |
|
||||
| `br_me_caged` | microdados_antigos, microdados_antigos_ajustes | dicionario, microdados_movimentacao* |
|
||||
| `br_me_cno` | microdados, microdados_cnae, microdados_vinculo | dicionario, microdados |
|
||||
| `br_me_rais` | all tables | dicionario, microdados_estabelecimentos, microdados_vinculos |
|
||||
| `br_me_cno` | microdados_cnae, microdados_vinculo | dicionario, microdados |
|
||||
| `br_mec_prouni` | microdados | dicionario |
|
||||
| `br_ms_sim` | municipio, municipio_causa, municipio_causa_idade, municipio_causa_idade_sexo_raca | dicionario, microdados |
|
||||
| `br_ms_sinan` | microdados_violencia | dicionario, microdados_dengue, microdados_influenza_srag |
|
||||
@@ -247,3 +269,13 @@ Basedosdados.org has **765 tables** on BigQuery, but only **533** are on S3 (and
|
||||
| `br_seeg_emissoes` | brasil | dicionario, municipio, uf |
|
||||
| `br_tse_eleicoes` | local_secao | all others |
|
||||
| `world_oecd_pisa` | dictionary, school_summary, student_summary | student |
|
||||
|
||||
### Tables that don't exist in BigQuery (3)
|
||||
|
||||
These were listed in datasets_to_scrap but actually don't exist in `basedosdados`:
|
||||
|
||||
| Dataset | Table |
|
||||
|---------|-------|
|
||||
| `br_bcb_sicor` | microdados_liberacao |
|
||||
| `br_bcb_sicor` | microdados_operacao |
|
||||
| `br_bcb_sicor` | microdados_saldo |
|
||||
|
||||
270
tasks/missing_tables.txt
Normal file
270
tasks/missing_tables.txt
Normal file
@@ -0,0 +1,270 @@
|
||||
br_abrinq_oca.municipio_primeira_infancia
|
||||
br_ana_atlas_esgotos.municipio
|
||||
br_ana_reservatorios.sin
|
||||
br_anvisa_medicamentos_industrializados.microdados
|
||||
br_ba_feiradesantana_camara_leis.microdados
|
||||
br_bd_diretorios_data_tempo.ano
|
||||
br_bd_diretorios_data_tempo.bimestre
|
||||
br_bd_diretorios_data_tempo.data
|
||||
br_bd_diretorios_data_tempo.dia
|
||||
br_bd_diretorios_data_tempo.hora
|
||||
br_bd_diretorios_data_tempo.mes
|
||||
br_bd_diretorios_data_tempo.minuto
|
||||
br_bd_diretorios_data_tempo.segundo
|
||||
br_bd_diretorios_data_tempo.semestre
|
||||
br_bd_diretorios_data_tempo.tempo
|
||||
br_bd_diretorios_data_tempo.trimestre
|
||||
br_bd_metadados.bigquery_tables
|
||||
br_bd_metadados.external_links
|
||||
br_bd_metadados.information_requests
|
||||
br_bd_metadados.organizations
|
||||
br_bd_metadados.prefect_flow_runs
|
||||
br_bd_metadados.resources
|
||||
br_bd_metadados.tables
|
||||
br_bd_vizinhanca.municipio
|
||||
br_bd_vizinhanca.uf
|
||||
br_caixa_sorteios.megasena
|
||||
br_camara_dados_abertos.deputado
|
||||
br_camara_dados_abertos.deputado_ocupacao
|
||||
br_camara_dados_abertos.deputado_profissao
|
||||
br_camara_dados_abertos.despesa
|
||||
br_camara_dados_abertos.evento
|
||||
br_camara_dados_abertos.evento_orgao
|
||||
br_camara_dados_abertos.evento_presenca_deputado
|
||||
br_camara_dados_abertos.evento_requerimento
|
||||
br_camara_dados_abertos.frente
|
||||
br_camara_dados_abertos.frente_deputado
|
||||
br_camara_dados_abertos.funcionario
|
||||
br_camara_dados_abertos.legislatura
|
||||
br_camara_dados_abertos.legislatura_mesa
|
||||
br_camara_dados_abertos.licitacao
|
||||
br_camara_dados_abertos.licitacao_contrato
|
||||
br_camara_dados_abertos.licitacao_item
|
||||
br_camara_dados_abertos.licitacao_pedido
|
||||
br_camara_dados_abertos.licitacao_proposta
|
||||
br_camara_dados_abertos.orgao
|
||||
br_camara_dados_abertos.orgao_deputado
|
||||
br_camara_dados_abertos.proposicao_autor
|
||||
br_camara_dados_abertos.proposicao_microdados
|
||||
br_camara_dados_abertos.proposicao_tema
|
||||
br_camara_dados_abertos.sigla_partido
|
||||
br_camara_dados_abertos.votacao
|
||||
br_camara_dados_abertos.votacao_objeto
|
||||
br_camara_dados_abertos.votacao_orientacao_bancada
|
||||
br_camara_dados_abertos.votacao_parlamentar
|
||||
br_camara_dados_abertos.votacao_proposicao
|
||||
br_capes_bolsas.mobilidade_internacional
|
||||
br_cgu_ebt.municipio
|
||||
br_cgu_ebt.uf
|
||||
br_cgu_fef.microdados
|
||||
br_cgu_fef.municipios_sorteados
|
||||
br_cgu_fef.sorteio
|
||||
br_cgu_pessoal_executivo_federal.terceirizados
|
||||
br_clp_ranking_competitividade.nota_geral_municipio
|
||||
br_clp_ranking_competitividade.nota_geral_uf
|
||||
br_cnj_estatisticas_poder_judiciario.recursos_financeiros
|
||||
br_fbsp_absp.municipio
|
||||
br_fbsp_absp.uf
|
||||
br_fbsp_absp.violencia_escola
|
||||
br_firjan_ifgf.ranking
|
||||
br_ggb_relatorio_lgbtqi.brasil
|
||||
br_ggb_relatorio_lgbtqi.causa_obito
|
||||
br_ggb_relatorio_lgbtqi.grupo_lgbtqia
|
||||
br_ggb_relatorio_lgbtqi.local
|
||||
br_ggb_relatorio_lgbtqi.raca_cor
|
||||
br_ibge_amc.municipio_de_para
|
||||
br_ibge_cbo_2002.perfil_ocupacional
|
||||
br_ibge_cbo_2002.sinonimo
|
||||
br_ibge_estadic.comunicacao_informatica
|
||||
br_ibge_estadic.dicionario
|
||||
br_ibge_estadic.educacao
|
||||
br_ibge_estadic.governanca
|
||||
br_ibge_estadic.indicadores_perfil_gestor
|
||||
br_ibge_estadic.indicadores_quantidade_vinculo
|
||||
br_ibge_estadic.politica_mulher
|
||||
br_ibge_estadic.recursos_humanos
|
||||
br_ibge_ipp.mes_categoria_economica
|
||||
br_ibge_ipp.mes_grupo_industrial
|
||||
br_ibge_ipp.mes_industria_atividade
|
||||
br_ibge_ipp.mes_industria_extrativa
|
||||
br_ibge_ipp.mes_industria_geral
|
||||
br_ibge_ipp.mes_industria_transformacao
|
||||
br_ibge_munic.indicadores_perfil_gestor
|
||||
br_ibge_munic.indicadores_quantidade_vinculo
|
||||
br_ibge_nomes_brasil.quantidade_municipio_nome_2010
|
||||
br_ieps_saude.brasil
|
||||
br_ieps_saude.macrorregiao
|
||||
br_ieps_saude.municipio
|
||||
br_ieps_saude.regiao_saude
|
||||
br_ieps_saude.uf
|
||||
br_imprensa_nacional_dou.secao_1
|
||||
br_imprensa_nacional_dou.secao_2
|
||||
br_imprensa_nacional_dou.secao_3
|
||||
br_ipea_acesso_oportunidades.estatisticas_2019
|
||||
br_ipea_acesso_oportunidades.indicadores_2019
|
||||
br_mapbiomas_estatisticas.classe
|
||||
br_mapbiomas_estatisticas.cobertura_municipio_classe
|
||||
br_mapbiomas_estatisticas.cobertura_uf_classe
|
||||
br_mapbiomas_estatisticas.transicao_municipio_de_para_anual
|
||||
br_mapbiomas_estatisticas.transicao_municipio_de_para_decenal
|
||||
br_mapbiomas_estatisticas.transicao_municipio_de_para_quinquenal
|
||||
br_mapbiomas_estatisticas.transicao_uf_de_para_anual
|
||||
br_mapbiomas_estatisticas.transicao_uf_de_para_decenal
|
||||
br_mapbiomas_estatisticas.transicao_uf_de_para_quinquenal
|
||||
br_mc_indicadores.transferencias_municipio
|
||||
br_me_clima_organizacional.microdados
|
||||
br_me_estoque_divida_publica.microdados
|
||||
br_me_exportadoras_importadoras.dicionario
|
||||
br_me_exportadoras_importadoras.estabelecimentos
|
||||
br_me_pensionistas.microdados
|
||||
br_me_siape.servidores_executivo_federal
|
||||
br_me_siorg.remuneracao
|
||||
br_mma_extincao.fauna_ameacada
|
||||
br_mma_extincao.flora_ameacada
|
||||
br_mobilidados_indicadores.comprometimento_renda_tarifa_transp_publico
|
||||
br_mobilidados_indicadores.divisao_modal
|
||||
br_mobilidados_indicadores.emissao_co2_material_particulado
|
||||
br_mobilidados_indicadores.proporcao_domicilios_infra_urbana
|
||||
br_mobilidados_indicadores.proporcao_mortes_negras_acidente_transporte
|
||||
br_mobilidados_indicadores.proporcao_pessoas_prox_infra_cicloviaria
|
||||
br_mobilidados_indicadores.proporcao_pessoas_proximas_pnt
|
||||
br_mobilidados_indicadores.taxa_motorizacao
|
||||
br_mobilidados_indicadores.tempo_deslocamento_casa_trabalho
|
||||
br_mobilidados_indicadores.transporte_media_alta_capacidade
|
||||
br_ms_atencao_basica.municipio
|
||||
br_ms_imunizacoes.municipio
|
||||
br_ons_energia_armazenada.subsistemas
|
||||
br_rj_rio_de_janeiro_ipp_ips.dimensoes_componentes
|
||||
br_rj_rio_de_janeiro_ipp_ips.indicadores
|
||||
br_rj_tce_iegm.indicadores
|
||||
br_senado_cpipandemia.discursos
|
||||
br_sgp_informacao.despesas_cartao_corporativo
|
||||
br_sp_alesp.assessores_lideranca
|
||||
br_sp_alesp.assessores_parlamentares
|
||||
br_sp_alesp.deputado
|
||||
br_sp_alesp.despesas_gabinete
|
||||
br_sp_alesp.despesas_gabinete_atual
|
||||
br_sp_gov_orcamento.despesa
|
||||
br_sp_gov_orcamento.receita_arrecadada
|
||||
br_sp_gov_orcamento.receita_prevista
|
||||
br_sp_gov_ssp.ocorrencias_registradas
|
||||
br_sp_gov_ssp.produtividade_policial
|
||||
br_sp_saopaulo_dieese_icv.ano
|
||||
br_sp_seduc_fluxo_escolar.escola
|
||||
br_sp_seduc_fluxo_escolar.municipio
|
||||
br_sp_seduc_idesp.diretoria
|
||||
br_sp_seduc_idesp.escola
|
||||
br_sp_seduc_idesp.uf
|
||||
br_sp_seduc_inse.escola
|
||||
br_tpe_classificacao_saeb.categoria
|
||||
eu_fra_lgbt.consciencia_direitos
|
||||
eu_fra_lgbt.cotidiano
|
||||
eu_fra_lgbt.discriminacao
|
||||
eu_fra_lgbt.especifico_transgenero
|
||||
eu_fra_lgbt.violencia_abuso
|
||||
mundo_bm_learning_poverty.pais
|
||||
mundo_kaggle_olimpiadas.microdados
|
||||
mundo_onu_adh.brasil
|
||||
mundo_onu_adh.municipio
|
||||
mundo_onu_adh.uf
|
||||
mundo_transrespect_transphobia.causa_obito
|
||||
mundo_transrespect_transphobia.local
|
||||
mundo_transrespect_transphobia.pais
|
||||
nl_ug_pwt.microdados
|
||||
world_fao_production.country_group
|
||||
world_fao_production.crop_livestock
|
||||
world_fao_production.dictionary
|
||||
world_fao_production.element
|
||||
world_fao_production.item
|
||||
world_fao_production.item_group
|
||||
world_fao_production.production_indices
|
||||
world_fao_production.value_agricultural_production
|
||||
world_fifa_women_world_cup.matches
|
||||
world_fifa_worldcup.award_winners
|
||||
world_fifa_worldcup.matches
|
||||
world_fifa_worldcup.players
|
||||
world_fifa_worldcup.teams
|
||||
world_fifa_worldcup.tournaments
|
||||
world_gsps_consortium_gsps.global_indicators
|
||||
world_slave_voyages_consortium_slave_trade.transatlantic
|
||||
world_spi_spi.global_indicators
|
||||
world_ti_corruption_perception.country
|
||||
world_wb_wwbi.country_finance
|
||||
world_wb_wwbi.country_indicators
|
||||
br_anatel_banda_larga_fixa.backhaul
|
||||
br_anatel_banda_larga_fixa.pble
|
||||
br_bcb_sicor.microdados_liberacao
|
||||
br_bcb_sicor.microdados_operacao
|
||||
br_bcb_sicor.microdados_saldo
|
||||
br_bcb_taxa_cambio.taxa_cambio
|
||||
br_bcb_taxa_selic.taxa_selic
|
||||
br_ibge_pib.brasil_antigo
|
||||
br_ibge_pib.municipio_antigo
|
||||
br_ibge_pib.regiao_antigo
|
||||
br_ibge_pib.uf
|
||||
br_ibge_pib.uf_antigo
|
||||
br_ibge_pnad_covid.microdados
|
||||
br_ibge_pnadc.ano_brasil_grupo_idade
|
||||
br_ibge_pnadc.ano_brasil_raca_cor
|
||||
br_ibge_pnadc.ano_municipio_grupo_idade
|
||||
br_ibge_pnadc.ano_municipio_raca_cor
|
||||
br_ibge_pnadc.ano_regiao_grupo_idade
|
||||
br_ibge_pnadc.ano_regiao_metropolitana_grupo_idade
|
||||
br_ibge_pnadc.ano_regiao_metropolitana_raca_cor
|
||||
br_ibge_pnadc.ano_regiao_raca_cor
|
||||
br_ibge_pnadc.ano_uf_grupo_idade
|
||||
br_ibge_pnadc.ano_uf_raca_cor
|
||||
br_ibge_pof.aluguel_estimado_2017
|
||||
br_ibge_pof.cadastro_de_produtos_2017
|
||||
br_ibge_pof.caderneta_coletiva_2017
|
||||
br_ibge_pof.caracteristicas_dieta_2017
|
||||
br_ibge_pof.condicoes_vida_2017
|
||||
br_ibge_pof.consumo_alimentar_2017
|
||||
br_ibge_pof.despesa_coletiva_2017
|
||||
br_ibge_pof.despesa_individual_2017
|
||||
br_ibge_pof.domicilio_2017
|
||||
br_ibge_pof.inventario_2017
|
||||
br_ibge_pof.morador_2017
|
||||
br_ibge_pof.outros_rendimentos_2017
|
||||
br_ibge_pof.rendimento_trabalho_2017
|
||||
br_ibge_pof.restricao_saude_2017
|
||||
br_ibge_pof.servico_nao_monetario_pof2_2017
|
||||
br_ibge_pof.servico_nao_monetario_pof4_2017
|
||||
br_inep_ana.aluno
|
||||
br_inep_ana.escola
|
||||
br_inep_ana.prova
|
||||
br_inep_censo_escolar.docente
|
||||
br_inep_censo_escolar.matricula
|
||||
br_inep_formacao_docente.brasil
|
||||
br_inep_formacao_docente.escola
|
||||
br_inep_formacao_docente.municipio
|
||||
br_inep_formacao_docente.regiao
|
||||
br_inep_formacao_docente.uf
|
||||
br_inep_indicador_nivel_socioeconomico.brasil
|
||||
br_inep_indicador_nivel_socioeconomico.municipio
|
||||
br_inep_indicador_nivel_socioeconomico.uf
|
||||
br_inep_indicadores_educacionais.escola_nivel_socioeconomico
|
||||
br_inep_indicadores_educacionais.fluxo_educacao_superior
|
||||
br_inmet_bdmep.estacao
|
||||
br_me_caged.microdados_antigos
|
||||
br_me_caged.microdados_antigos_ajustes
|
||||
br_me_cno.microdados_cnae
|
||||
br_me_cno.microdados_vinculo
|
||||
br_me_rais.dicionario
|
||||
br_me_rais.microdados_estabelecimentos
|
||||
br_me_rais.microdados_vinculos
|
||||
br_mec_prouni.microdados
|
||||
br_ms_sim.municipio
|
||||
br_ms_sim.municipio_causa
|
||||
br_ms_sim.municipio_causa_idade
|
||||
br_ms_sim.municipio_causa_idade_sexo_raca
|
||||
br_ms_sinan.microdados_violencia
|
||||
br_ms_vacinacao_covid19.microdados
|
||||
br_ms_vacinacao_covid19.microdados_estabelecimento
|
||||
br_ms_vacinacao_covid19.microdados_paciente
|
||||
br_ms_vacinacao_covid19.microdados_vacinacao
|
||||
br_seeg_emissoes.brasil
|
||||
br_tse_eleicoes.local_secao
|
||||
world_oecd_pisa.dictionary
|
||||
world_oecd_pisa.school_summary
|
||||
world_oecd_pisa.student_summary
|
||||
2
tasks/pending_tables.txt
Normal file
2
tasks/pending_tables.txt
Normal file
@@ -0,0 +1,2 @@
|
||||
br_bcb_taxa_cambio.taxa_cambio
|
||||
br_bcb_taxa_selic.taxa_selic
|
||||
229
tasks/views_to_materialize.txt
Normal file
229
tasks/views_to_materialize.txt
Normal file
@@ -0,0 +1,229 @@
|
||||
br_abrinq_oca.municipio_primeira_infancia
|
||||
br_ana_atlas_esgotos.municipio
|
||||
br_ana_reservatorios.sin
|
||||
br_anvisa_medicamentos_industrializados.microdados
|
||||
br_ba_feiradesantana_camara_leis.microdados
|
||||
br_bd_diretorios_data_tempo.ano
|
||||
br_bd_diretorios_data_tempo.bimestre
|
||||
br_bd_diretorios_data_tempo.data
|
||||
br_bd_diretorios_data_tempo.dia
|
||||
br_bd_diretorios_data_tempo.hora
|
||||
br_bd_diretorios_data_tempo.mes
|
||||
br_bd_diretorios_data_tempo.minuto
|
||||
br_bd_diretorios_data_tempo.segundo
|
||||
br_bd_diretorios_data_tempo.semestre
|
||||
br_bd_diretorios_data_tempo.tempo
|
||||
br_bd_diretorios_data_tempo.trimestre
|
||||
br_bd_metadados.external_links
|
||||
br_bd_metadados.information_requests
|
||||
br_bd_metadados.organizations
|
||||
br_bd_metadados.resources
|
||||
br_bd_metadados.tables
|
||||
br_bd_vizinhanca.municipio
|
||||
br_bd_vizinhanca.uf
|
||||
br_caixa_sorteios.megasena
|
||||
br_camara_dados_abertos.sigla_partido
|
||||
br_capes_bolsas.mobilidade_internacional
|
||||
br_cgu_ebt.municipio
|
||||
br_cgu_ebt.uf
|
||||
br_cgu_fef.microdados
|
||||
br_cgu_fef.municipios_sorteados
|
||||
br_cgu_fef.sorteio
|
||||
br_cgu_pessoal_executivo_federal.terceirizados
|
||||
br_clp_ranking_competitividade.nota_geral_municipio
|
||||
br_clp_ranking_competitividade.nota_geral_uf
|
||||
br_cnj_estatisticas_poder_judiciario.recursos_financeiros
|
||||
br_fbsp_absp.municipio
|
||||
br_firjan_ifgf.ranking
|
||||
br_ggb_relatorio_lgbtqi.brasil
|
||||
br_ggb_relatorio_lgbtqi.causa_obito
|
||||
br_ggb_relatorio_lgbtqi.grupo_lgbtqia
|
||||
br_ggb_relatorio_lgbtqi.local
|
||||
br_ggb_relatorio_lgbtqi.raca_cor
|
||||
br_ibge_amc.municipio_de_para
|
||||
br_ibge_cbo_2002.perfil_ocupacional
|
||||
br_ibge_cbo_2002.sinonimo
|
||||
br_ibge_estadic.comunicacao_informatica
|
||||
br_ibge_estadic.educacao
|
||||
br_ibge_estadic.governanca
|
||||
br_ibge_estadic.indicadores_perfil_gestor
|
||||
br_ibge_estadic.indicadores_quantidade_vinculo
|
||||
br_ibge_estadic.politica_mulher
|
||||
br_ibge_estadic.recursos_humanos
|
||||
br_ibge_ipp.mes_categoria_economica
|
||||
br_ibge_ipp.mes_grupo_industrial
|
||||
br_ibge_ipp.mes_industria_atividade
|
||||
br_ibge_ipp.mes_industria_extrativa
|
||||
br_ibge_ipp.mes_industria_geral
|
||||
br_ibge_ipp.mes_industria_transformacao
|
||||
br_ibge_munic.indicadores_perfil_gestor
|
||||
br_ibge_munic.indicadores_quantidade_vinculo
|
||||
br_ibge_nomes_brasil.quantidade_municipio_nome_2010
|
||||
br_ibge_pib.brasil_antigo
|
||||
br_ibge_pib.municipio_antigo
|
||||
br_ibge_pib.regiao_antigo
|
||||
br_ibge_pib.uf
|
||||
br_ibge_pib.uf_antigo
|
||||
br_ibge_pnad_covid.microdados
|
||||
br_ibge_pnadc.ano_brasil_grupo_idade
|
||||
br_ibge_pnadc.ano_brasil_raca_cor
|
||||
br_ibge_pnadc.ano_municipio_grupo_idade
|
||||
br_ibge_pnadc.ano_municipio_raca_cor
|
||||
br_ibge_pnadc.ano_regiao_grupo_idade
|
||||
br_ibge_pnadc.ano_regiao_metropolitana_grupo_idade
|
||||
br_ibge_pnadc.ano_regiao_metropolitana_raca_cor
|
||||
br_ibge_pnadc.ano_regiao_raca_cor
|
||||
br_ibge_pnadc.ano_uf_grupo_idade
|
||||
br_ibge_pnadc.ano_uf_raca_cor
|
||||
br_ibge_pof.aluguel_estimado_2017
|
||||
br_ibge_pof.cadastro_de_produtos_2017
|
||||
br_ibge_pof.caderneta_coletiva_2017
|
||||
br_ibge_pof.caracteristicas_dieta_2017
|
||||
br_ibge_pof.condicoes_vida_2017
|
||||
br_ibge_pof.consumo_alimentar_2017
|
||||
br_ibge_pof.despesa_coletiva_2017
|
||||
br_ibge_pof.despesa_individual_2017
|
||||
br_ibge_pof.domicilio_2017
|
||||
br_ibge_pof.inventario_2017
|
||||
br_ibge_pof.morador_2017
|
||||
br_ibge_pof.outros_rendimentos_2017
|
||||
br_ibge_pof.rendimento_trabalho_2017
|
||||
br_ibge_pof.restricao_saude_2017
|
||||
br_ibge_pof.servico_nao_monetario_pof2_2017
|
||||
br_ibge_pof.servico_nao_monetario_pof4_2017
|
||||
br_ieps_saude.brasil
|
||||
br_ieps_saude.macrorregiao
|
||||
br_ieps_saude.municipio
|
||||
br_ieps_saude.regiao_saude
|
||||
br_ieps_saude.uf
|
||||
br_imprensa_nacional_dou.secao_1
|
||||
br_imprensa_nacional_dou.secao_2
|
||||
br_imprensa_nacional_dou.secao_3
|
||||
br_ipea_acesso_oportunidades.estatisticas_2019
|
||||
br_ipea_acesso_oportunidades.indicadores_2019
|
||||
br_mapbiomas_estatisticas.classe
|
||||
br_mapbiomas_estatisticas.cobertura_municipio_classe
|
||||
br_mapbiomas_estatisticas.cobertura_uf_classe
|
||||
br_mapbiomas_estatisticas.transicao_municipio_de_para_anual
|
||||
br_mapbiomas_estatisticas.transicao_municipio_de_para_decenal
|
||||
br_mapbiomas_estatisticas.transicao_municipio_de_para_quinquenal
|
||||
br_mapbiomas_estatisticas.transicao_uf_de_para_anual
|
||||
br_mapbiomas_estatisticas.transicao_uf_de_para_decenal
|
||||
br_mapbiomas_estatisticas.transicao_uf_de_para_quinquenal
|
||||
br_mc_indicadores.transferencias_municipio
|
||||
br_me_caged.microdados_antigos
|
||||
br_me_caged.microdados_antigos_ajustes
|
||||
br_me_clima_organizacional.microdados
|
||||
br_me_cno.microdados_cnae
|
||||
br_me_cno.microdados_vinculo
|
||||
br_me_estoque_divida_publica.microdados
|
||||
br_me_exportadoras_importadoras.dicionario
|
||||
br_me_exportadoras_importadoras.estabelecimentos
|
||||
br_me_pensionistas.microdados
|
||||
br_me_siape.servidores_executivo_federal
|
||||
br_me_siorg.remuneracao
|
||||
br_mec_prouni.microdados
|
||||
br_mma_extincao.fauna_ameacada
|
||||
br_mma_extincao.flora_ameacada
|
||||
br_mobilidados_indicadores.comprometimento_renda_tarifa_transp_publico
|
||||
br_mobilidados_indicadores.divisao_modal
|
||||
br_mobilidados_indicadores.emissao_co2_material_particulado
|
||||
br_mobilidados_indicadores.proporcao_domicilios_infra_urbana
|
||||
br_mobilidados_indicadores.proporcao_mortes_negras_acidente_transporte
|
||||
br_mobilidados_indicadores.proporcao_pessoas_prox_infra_cicloviaria
|
||||
br_mobilidados_indicadores.proporcao_pessoas_proximas_pnt
|
||||
br_mobilidados_indicadores.taxa_motorizacao
|
||||
br_mobilidados_indicadores.tempo_deslocamento_casa_trabalho
|
||||
br_mobilidados_indicadores.transporte_media_alta_capacidade
|
||||
br_ms_atencao_basica.municipio
|
||||
br_ms_imunizacoes.municipio
|
||||
br_ms_sim.municipio
|
||||
br_ms_sim.municipio_causa
|
||||
br_ms_sim.municipio_causa_idade
|
||||
br_ms_sim.municipio_causa_idade_sexo_raca
|
||||
br_ms_sinan.microdados_violencia
|
||||
br_ms_vacinacao_covid19.microdados
|
||||
br_ms_vacinacao_covid19.microdados_estabelecimento
|
||||
br_ms_vacinacao_covid19.microdados_paciente
|
||||
br_ms_vacinacao_covid19.microdados_vacinacao
|
||||
br_ons_energia_armazenada.subsistemas
|
||||
br_rj_rio_de_janeiro_ipp_ips.dimensoes_componentes
|
||||
br_rj_rio_de_janeiro_ipp_ips.indicadores
|
||||
br_rj_tce_iegm.indicadores
|
||||
br_seeg_emissoes.brasil
|
||||
br_senado_cpipandemia.discursos
|
||||
br_sgp_informacao.despesas_cartao_corporativo
|
||||
br_sp_alesp.assessores_lideranca
|
||||
br_sp_alesp.assessores_parlamentares
|
||||
br_sp_alesp.deputado
|
||||
br_sp_alesp.despesas_gabinete
|
||||
br_sp_alesp.despesas_gabinete_atual
|
||||
br_sp_gov_orcamento.despesa
|
||||
br_sp_gov_orcamento.receita_arrecadada
|
||||
br_sp_gov_orcamento.receita_prevista
|
||||
br_sp_gov_ssp.ocorrencias_registradas
|
||||
br_sp_gov_ssp.produtividade_policial
|
||||
br_sp_saopaulo_dieese_icv.ano
|
||||
br_sp_seduc_fluxo_escolar.escola
|
||||
br_sp_seduc_fluxo_escolar.municipio
|
||||
br_sp_seduc_idesp.diretoria
|
||||
br_sp_seduc_idesp.escola
|
||||
br_sp_seduc_idesp.uf
|
||||
br_sp_seduc_inse.escola
|
||||
br_tpe_classificacao_saeb.categoria
|
||||
br_tse_eleicoes.local_secao
|
||||
eu_fra_lgbt.consciencia_direitos
|
||||
eu_fra_lgbt.cotidiano
|
||||
eu_fra_lgbt.discriminacao
|
||||
eu_fra_lgbt.especifico_transgenero
|
||||
eu_fra_lgbt.violencia_abuso
|
||||
mundo_bm_learning_poverty.pais
|
||||
mundo_kaggle_olimpiadas.microdados
|
||||
mundo_onu_adh.brasil
|
||||
mundo_onu_adh.municipio
|
||||
mundo_onu_adh.uf
|
||||
mundo_transrespect_transphobia.causa_obito
|
||||
mundo_transrespect_transphobia.local
|
||||
mundo_transrespect_transphobia.pais
|
||||
nl_ug_pwt.microdados
|
||||
world_fao_production.country_group
|
||||
world_fao_production.crop_livestock
|
||||
world_fao_production.dictionary
|
||||
world_fao_production.element
|
||||
world_fao_production.item
|
||||
world_fao_production.item_group
|
||||
world_fao_production.production_indices
|
||||
world_fao_production.value_agricultural_production
|
||||
world_fifa_women_world_cup.matches
|
||||
world_fifa_worldcup.award_winners
|
||||
world_fifa_worldcup.matches
|
||||
world_fifa_worldcup.players
|
||||
world_fifa_worldcup.teams
|
||||
world_fifa_worldcup.tournaments
|
||||
world_gsps_consortium_gsps.global_indicators
|
||||
world_oecd_pisa.dictionary
|
||||
world_oecd_pisa.school_summary
|
||||
world_oecd_pisa.student_summary
|
||||
world_slave_voyages_consortium_slave_trade.transatlantic
|
||||
world_spi_spi.global_indicators
|
||||
world_ti_corruption_perception.country
|
||||
world_wb_wwbi.country_finance
|
||||
world_wb_wwbi.country_indicators
|
||||
br_anatel_banda_larga_fixa.backhaul
|
||||
br_anatel_banda_larga_fixa.pble
|
||||
br_inep_ana.aluno
|
||||
br_inep_ana.escola
|
||||
br_inep_ana.prova
|
||||
br_inep_censo_escolar.docente
|
||||
br_inep_censo_escolar.matricula
|
||||
br_inep_formacao_docente.brasil
|
||||
br_inep_formacao_docente.escola
|
||||
br_inep_formacao_docente.municipio
|
||||
br_inep_formacao_docente.regiao
|
||||
br_inep_formacao_docente.uf
|
||||
br_inep_indicador_nivel_socioeconomico.brasil
|
||||
br_inep_indicador_nivel_socioeconomico.municipio
|
||||
br_inep_indicador_nivel_socioeconomico.uf
|
||||
br_inep_indicadores_educacionais.escola_nivel_socioeconomico
|
||||
br_inep_indicadores_educacionais.fluxo_educacao_superior
|
||||
br_inmet_bdmep.estacao
|
||||
Reference in New Issue
Block a user