baseldosdados/docs/dataset_embeds.md at e9b680379b435992ad990976dd1486d28173265c

Files

rafapolo ed5fa6756e refactor: reorganize project structure and fix broken references

- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.)
- Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml)
- Move basedosdados.duckdb to data/ directory
- Update Dockerfile and start.sh with new file paths
- Update README.md with correct script paths
- Remove Python ask.py (replaced by Rust binary in ask/ask)
- Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs)
- Remove sentence-transformer dependencies from ask
- Move docs and context artifacts to their directories

2026-03-29 20:46:27 +02:00

3.2 KiB

Raw Blame History

Goal

Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.

Instructions

Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
Include column descriptions from basedosdados-schema.json in table embeddings
Generate word clouds from schema attributes and dataset names for docs

Discoveries

Schema format: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
Embeddings work: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
Threshold tuning: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
sqlcoder issues: Returns JSON instead of SQL when using format: "json" - removing it helps but still generates imperfect SQL
Retry mechanism: Already built into main.rs - helps fix SQL errors automatically
Top donation query works: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)

Accomplished

✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json
✅ Created table_embeddings.json (~2MB, 765 tables)
✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
✅ Created schema_filter.rs - extracts filtered schema from full JSON
✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
✅ Modified main.rs - integrated table selection + configurable SQL generator
✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes)
✅ Updated README.md with new architecture and env vars
✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/

Relevant files / directories

Created/Modified

embed_tables.py - Python script to generate table embeddings
context/table_embeddings.json - Pre-computed embeddings (765 tables)
ask/src/table_selector.rs - Table selection via embeddings
ask/src/schema_filter.rs - Schema filtering module
ask/src/sql_generator.rs - SQL generator trait + implementations
ask/src/main.rs - Integrated all components
ask/Cargo.toml - Added serde dependency
README.md - Updated with new architecture
docs/wordcloud_attributes.png - Word cloud from column names/descriptions
docs/wordcloud_datasets.png - Word cloud from dataset names

Configuration (env vars)

SQL_GENERATOR - sqlcoder|gemini|openrouter
SIMILARITY_THRESHOLD - 0.35 default
OLLAMA_MODEL - sqlcoder:7b-q4_K_M
EMBEDDINGS_FILE, SCHEMA_JSON

Next Steps

Increase similarity threshold (try 0.45) to reduce table count
Improve sqlcoder prompt for better SQL generation
Add fallback to increase threshold if too many tables selected
Consider keyword matching as backup if embeddings fail

3.2 KiB Raw Blame History