refactor: reorganize project structure and fix broken references

- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.) - Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml) - Move basedosdados.duckdb to data/ directory - Update Dockerfile and start.sh with new file paths - Update README.md with correct script paths - Remove Python ask.py (replaced by Rust binary in ask/ask) - Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs) - Remove sentence-transformer dependencies from ask - Move docs and context artifacts to their directories
2026-03-29 20:46:27 +02:00
parent 02cb13362c
commit ed5fa6756e
43 changed files with 302366 additions and 1093 deletions
--- a/docs/dataset_embeds.md
+++ b/docs/dataset_embeds.md
@@ -0,0 +1,59 @@
+## Goal
+
+Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.
+
+## Instructions
+
+- Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
+- Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
+- Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
+- Include column descriptions from basedosdados-schema.json in table embeddings
+- Generate word clouds from schema attributes and dataset names for docs
+
+## Discoveries
+
+- **Schema format**: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
+- **Embeddings work**: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
+- **Threshold tuning**: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
+- **sqlcoder issues**: Returns JSON instead of SQL when using `format: "json"` - removing it helps but still generates imperfect SQL
+- **Retry mechanism**: Already built into main.rs - helps fix SQL errors automatically
+- **Top donation query works**: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)
+
+## Accomplished
+
+1. ✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json
+2. ✅ Created table_embeddings.json (~2MB, 765 tables)
+3. ✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
+4. ✅ Created schema_filter.rs - extracts filtered schema from full JSON
+5. ✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
+6. ✅ Modified main.rs - integrated table selection + configurable SQL generator
+7. ✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes)
+8. ✅ Updated README.md with new architecture and env vars
+9. ✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/
+
+## Relevant files / directories
+
+### Created/Modified
+- `embed_tables.py` - Python script to generate table embeddings
+- `context/table_embeddings.json` - Pre-computed embeddings (765 tables)
+- `ask/src/table_selector.rs` - Table selection via embeddings
+- `ask/src/schema_filter.rs` - Schema filtering module
+- `ask/src/sql_generator.rs` - SQL generator trait + implementations
+- `ask/src/main.rs` - Integrated all components
+- `ask/Cargo.toml` - Added serde dependency
+- `README.md` - Updated with new architecture
+- `docs/wordcloud_attributes.png` - Word cloud from column names/descriptions
+- `docs/wordcloud_datasets.png` - Word cloud from dataset names
+
+### Configuration (env vars)
+- `SQL_GENERATOR` - sqlcoder|gemini|openrouter
+- `SIMILARITY_THRESHOLD` - 0.35 default
+- `OLLAMA_MODEL` - sqlcoder:7b-q4_K_M
+- `EMBEDDINGS_FILE`, `SCHEMA_JSON`
+
+## Next Steps
+
+- Increase similarity threshold (try 0.45) to reduce table count
+- Improve sqlcoder prompt for better SQL generation
+- Add fallback to increase threshold if too many tables selected
+- Consider keyword matching as backup if embeddings fail