## Goal Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs. ## Instructions - Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity - Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables - Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars - Include column descriptions from basedosdados-schema.json in table embeddings - Generate word clouds from schema attributes and dataset names for docs ## Discoveries - **Schema format**: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB) - **Embeddings work**: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables - **Threshold tuning**: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones - **sqlcoder issues**: Returns JSON instead of SQL when using `format: "json"` - removing it helps but still generates imperfect SQL - **Retry mechanism**: Already built into main.rs - helps fix SQL errors automatically - **Top donation query works**: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.) ## Accomplished 1. ✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json 2. ✅ Created table_embeddings.json (~2MB, 765 tables) 3. ✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold 4. ✅ Created schema_filter.rs - extracts filtered schema from full JSON 5. ✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter 6. ✅ Modified main.rs - integrated table selection + configurable SQL generator 7. ✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes) 8. ✅ Updated README.md with new architecture and env vars 9. ✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/ ## Relevant files / directories ### Created/Modified - `embed_tables.py` - Python script to generate table embeddings - `context/table_embeddings.json` - Pre-computed embeddings (765 tables) - `ask/src/table_selector.rs` - Table selection via embeddings - `ask/src/schema_filter.rs` - Schema filtering module - `ask/src/sql_generator.rs` - SQL generator trait + implementations - `ask/src/main.rs` - Integrated all components - `ask/Cargo.toml` - Added serde dependency - `README.md` - Updated with new architecture - `docs/wordcloud_attributes.png` - Word cloud from column names/descriptions - `docs/wordcloud_datasets.png` - Word cloud from dataset names ### Configuration (env vars) - `SQL_GENERATOR` - sqlcoder|gemini|openrouter - `SIMILARITY_THRESHOLD` - 0.35 default - `OLLAMA_MODEL` - sqlcoder:7b-q4_K_M - `EMBEDDINGS_FILE`, `SCHEMA_JSON` ## Next Steps - Increase similarity threshold (try 0.45) to reduce table count - Improve sqlcoder prompt for better SQL generation - Add fallback to increase threshold if too many tables selected - Consider keyword matching as backup if embeddings fail