Files
baseldosdados/docs/dataset_embeds.md
rafapolo ed5fa6756e refactor: reorganize project structure and fix broken references
- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.)
- Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml)
- Move basedosdados.duckdb to data/ directory
- Update Dockerfile and start.sh with new file paths
- Update README.md with correct script paths
- Remove Python ask.py (replaced by Rust binary in ask/ask)
- Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs)
- Remove sentence-transformer dependencies from ask
- Move docs and context artifacts to their directories
2026-03-29 20:46:27 +02:00

3.2 KiB

Goal

Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.

Instructions

  • Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
  • Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
  • Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
  • Include column descriptions from basedosdados-schema.json in table embeddings
  • Generate word clouds from schema attributes and dataset names for docs

Discoveries

  • Schema format: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
  • Embeddings work: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
  • Threshold tuning: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
  • sqlcoder issues: Returns JSON instead of SQL when using format: "json" - removing it helps but still generates imperfect SQL
  • Retry mechanism: Already built into main.rs - helps fix SQL errors automatically
  • Top donation query works: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)

Accomplished

  1. Created embed_tables.py - generates embeddings from basedosdados-schema.json
  2. Created table_embeddings.json (~2MB, 765 tables)
  3. Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
  4. Created schema_filter.rs - extracts filtered schema from full JSON
  5. Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
  6. Modified main.rs - integrated table selection + configurable SQL generator
  7. Fixed existing Rust compilation errors in main.rs (ratatui API changes)
  8. Updated README.md with new architecture and env vars
  9. Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/

Relevant files / directories

Created/Modified

  • embed_tables.py - Python script to generate table embeddings
  • context/table_embeddings.json - Pre-computed embeddings (765 tables)
  • ask/src/table_selector.rs - Table selection via embeddings
  • ask/src/schema_filter.rs - Schema filtering module
  • ask/src/sql_generator.rs - SQL generator trait + implementations
  • ask/src/main.rs - Integrated all components
  • ask/Cargo.toml - Added serde dependency
  • README.md - Updated with new architecture
  • docs/wordcloud_attributes.png - Word cloud from column names/descriptions
  • docs/wordcloud_datasets.png - Word cloud from dataset names

Configuration (env vars)

  • SQL_GENERATOR - sqlcoder|gemini|openrouter
  • SIMILARITY_THRESHOLD - 0.35 default
  • OLLAMA_MODEL - sqlcoder:7b-q4_K_M
  • EMBEDDINGS_FILE, SCHEMA_JSON

Next Steps

  • Increase similarity threshold (try 0.45) to reduce table count
  • Improve sqlcoder prompt for better SQL generation
  • Add fallback to increase threshold if too many tables selected
  • Consider keyword matching as backup if embeddings fail