- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.) - Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml) - Move basedosdados.duckdb to data/ directory - Update Dockerfile and start.sh with new file paths - Update README.md with correct script paths - Remove Python ask.py (replaced by Rust binary in ask/ask) - Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs) - Remove sentence-transformer dependencies from ask - Move docs and context artifacts to their directories
3.2 KiB
3.2 KiB
Goal
Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.
Instructions
- Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
- Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
- Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
- Include column descriptions from basedosdados-schema.json in table embeddings
- Generate word clouds from schema attributes and dataset names for docs
Discoveries
- Schema format: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
- Embeddings work: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
- Threshold tuning: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
- sqlcoder issues: Returns JSON instead of SQL when using
format: "json"- removing it helps but still generates imperfect SQL - Retry mechanism: Already built into main.rs - helps fix SQL errors automatically
- Top donation query works: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)
Accomplished
- ✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json
- ✅ Created table_embeddings.json (~2MB, 765 tables)
- ✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
- ✅ Created schema_filter.rs - extracts filtered schema from full JSON
- ✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
- ✅ Modified main.rs - integrated table selection + configurable SQL generator
- ✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes)
- ✅ Updated README.md with new architecture and env vars
- ✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/
Relevant files / directories
Created/Modified
embed_tables.py- Python script to generate table embeddingscontext/table_embeddings.json- Pre-computed embeddings (765 tables)ask/src/table_selector.rs- Table selection via embeddingsask/src/schema_filter.rs- Schema filtering moduleask/src/sql_generator.rs- SQL generator trait + implementationsask/src/main.rs- Integrated all componentsask/Cargo.toml- Added serde dependencyREADME.md- Updated with new architecturedocs/wordcloud_attributes.png- Word cloud from column names/descriptionsdocs/wordcloud_datasets.png- Word cloud from dataset names
Configuration (env vars)
SQL_GENERATOR- sqlcoder|gemini|openrouterSIMILARITY_THRESHOLD- 0.35 defaultOLLAMA_MODEL- sqlcoder:7b-q4_K_MEMBEDDINGS_FILE,SCHEMA_JSON
Next Steps
- Increase similarity threshold (try 0.45) to reduce table count
- Improve sqlcoder prompt for better SQL generation
- Add fallback to increase threshold if too many tables selected
- Consider keyword matching as backup if embeddings fail