- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.) - Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml) - Move basedosdados.duckdb to data/ directory - Update Dockerfile and start.sh with new file paths - Update README.md with correct script paths - Remove Python ask.py (replaced by Rust binary in ask/ask) - Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs) - Remove sentence-transformer dependencies from ask - Move docs and context artifacts to their directories
60 lines
3.2 KiB
Markdown
60 lines
3.2 KiB
Markdown
## Goal
|
|
|
|
Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.
|
|
|
|
## Instructions
|
|
|
|
- Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
|
|
- Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
|
|
- Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
|
|
- Include column descriptions from basedosdados-schema.json in table embeddings
|
|
- Generate word clouds from schema attributes and dataset names for docs
|
|
|
|
## Discoveries
|
|
|
|
- **Schema format**: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
|
|
- **Embeddings work**: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
|
|
- **Threshold tuning**: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
|
|
- **sqlcoder issues**: Returns JSON instead of SQL when using `format: "json"` - removing it helps but still generates imperfect SQL
|
|
- **Retry mechanism**: Already built into main.rs - helps fix SQL errors automatically
|
|
- **Top donation query works**: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)
|
|
|
|
## Accomplished
|
|
|
|
1. ✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json
|
|
2. ✅ Created table_embeddings.json (~2MB, 765 tables)
|
|
3. ✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
|
|
4. ✅ Created schema_filter.rs - extracts filtered schema from full JSON
|
|
5. ✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
|
|
6. ✅ Modified main.rs - integrated table selection + configurable SQL generator
|
|
7. ✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes)
|
|
8. ✅ Updated README.md with new architecture and env vars
|
|
9. ✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/
|
|
|
|
## Relevant files / directories
|
|
|
|
### Created/Modified
|
|
- `embed_tables.py` - Python script to generate table embeddings
|
|
- `context/table_embeddings.json` - Pre-computed embeddings (765 tables)
|
|
- `ask/src/table_selector.rs` - Table selection via embeddings
|
|
- `ask/src/schema_filter.rs` - Schema filtering module
|
|
- `ask/src/sql_generator.rs` - SQL generator trait + implementations
|
|
- `ask/src/main.rs` - Integrated all components
|
|
- `ask/Cargo.toml` - Added serde dependency
|
|
- `README.md` - Updated with new architecture
|
|
- `docs/wordcloud_attributes.png` - Word cloud from column names/descriptions
|
|
- `docs/wordcloud_datasets.png` - Word cloud from dataset names
|
|
|
|
### Configuration (env vars)
|
|
- `SQL_GENERATOR` - sqlcoder|gemini|openrouter
|
|
- `SIMILARITY_THRESHOLD` - 0.35 default
|
|
- `OLLAMA_MODEL` - sqlcoder:7b-q4_K_M
|
|
- `EMBEDDINGS_FILE`, `SCHEMA_JSON`
|
|
|
|
## Next Steps
|
|
|
|
- Increase similarity threshold (try 0.45) to reduce table count
|
|
- Improve sqlcoder prompt for better SQL generation
|
|
- Add fallback to increase threshold if too many tables selected
|
|
- Consider keyword matching as backup if embeddings fail
|