refactor: reorganize project structure and fix broken references
- Move scripts to scripts/ directory (roda.sh, prepara_db.py, etc.) - Move shell config to shell/ directory (Caddyfile, auth.py, haloy.yml) - Move basedosdados.duckdb to data/ directory - Update Dockerfile and start.sh with new file paths - Update README.md with correct script paths - Remove Python ask.py (replaced by Rust binary in ask/ask) - Add Rust source files (schema_filter.rs, sql_generator.rs, table_selector.rs) - Remove sentence-transformer dependencies from ask - Move docs and context artifacts to their directories
This commit is contained in:
59
docs/dataset_embeds.md
Normal file
59
docs/dataset_embeds.md
Normal file
@@ -0,0 +1,59 @@
|
||||
## Goal
|
||||
|
||||
Build an intelligent SQL generator for Base dos Dados that uses semantic search (sentence-transformers) to select relevant tables from the schema before generating SQL, with the option to use local models (sqlcoder via Ollama) or external APIs.
|
||||
|
||||
## Instructions
|
||||
|
||||
- Use sentence-transformers (all-MiniLM-L6-v2) to embed table metadata and select relevant tables based on user question similarity
|
||||
- Use similarity threshold (default 0.35) instead of fixed top-k to dynamically select tables
|
||||
- Implement configurable SQL generator (sqlcoder/gemini/openrouter) via env vars
|
||||
- Include column descriptions from basedosdados-schema.json in table embeddings
|
||||
- Generate word clouds from schema attributes and dataset names for docs
|
||||
|
||||
## Discoveries
|
||||
|
||||
- **Schema format**: basedosdados-schema.json contains 765 tables with column names, types, and descriptions (~3.8MB)
|
||||
- **Embeddings work**: Using all-MiniLM-L6-v2 (384-dim) to match questions to tables
|
||||
- **Threshold tuning**: Default 0.35 threshold works best - lower returns too many tables (190+), higher may miss relevant ones
|
||||
- **sqlcoder issues**: Returns JSON instead of SQL when using `format: "json"` - removing it helps but still generates imperfect SQL
|
||||
- **Retry mechanism**: Already built into main.rs - helps fix SQL errors automatically
|
||||
- **Top donation query works**: "deputados com mais doacoes" successfully returned top 10 candidates with donation amounts (R$3.7M, R$3.3M, etc.)
|
||||
|
||||
## Accomplished
|
||||
|
||||
1. ✅ Created embed_tables.py - generates embeddings from basedosdados-schema.json
|
||||
2. ✅ Created table_embeddings.json (~2MB, 765 tables)
|
||||
3. ✅ Created table_selector.rs - loads embeddings, computes cosine similarity, selects tables by threshold
|
||||
4. ✅ Created schema_filter.rs - extracts filtered schema from full JSON
|
||||
5. ✅ Created sql_generator.rs - trait with implementations for sqlcoder, gemini, openrouter
|
||||
6. ✅ Modified main.rs - integrated table selection + configurable SQL generator
|
||||
7. ✅ Fixed existing Rust compilation errors in main.rs (ratatui API changes)
|
||||
8. ✅ Updated README.md with new architecture and env vars
|
||||
9. ✅ Created wordcloud scripts and generated wordcloud_attributes.png, wordcloud_datasets.png in docs/
|
||||
|
||||
## Relevant files / directories
|
||||
|
||||
### Created/Modified
|
||||
- `embed_tables.py` - Python script to generate table embeddings
|
||||
- `context/table_embeddings.json` - Pre-computed embeddings (765 tables)
|
||||
- `ask/src/table_selector.rs` - Table selection via embeddings
|
||||
- `ask/src/schema_filter.rs` - Schema filtering module
|
||||
- `ask/src/sql_generator.rs` - SQL generator trait + implementations
|
||||
- `ask/src/main.rs` - Integrated all components
|
||||
- `ask/Cargo.toml` - Added serde dependency
|
||||
- `README.md` - Updated with new architecture
|
||||
- `docs/wordcloud_attributes.png` - Word cloud from column names/descriptions
|
||||
- `docs/wordcloud_datasets.png` - Word cloud from dataset names
|
||||
|
||||
### Configuration (env vars)
|
||||
- `SQL_GENERATOR` - sqlcoder|gemini|openrouter
|
||||
- `SIMILARITY_THRESHOLD` - 0.35 default
|
||||
- `OLLAMA_MODEL` - sqlcoder:7b-q4_K_M
|
||||
- `EMBEDDINGS_FILE`, `SCHEMA_JSON`
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Increase similarity threshold (try 0.45) to reduce table count
|
||||
- Improve sqlcoder prompt for better SQL generation
|
||||
- Add fallback to increase threshold if too many tables selected
|
||||
- Consider keyword matching as backup if embeddings fail
|
||||
870
docs/file_tree.md
Normal file
870
docs/file_tree.md
Normal file
@@ -0,0 +1,870 @@
|
||||
# S3 File Tree: baseldosdados
|
||||
|
||||
## br_anatel_banda_larga_fixa/ (4 tables, 206.1 MB, 119 files)
|
||||
|
||||
- **densidade_brasil/** (1 files, 2.9 KB, 3 cols)
|
||||
- **densidade_municipio/** (2 files, 10.3 MB, 5 cols)
|
||||
- **densidade_uf/** (1 files, 50.8 KB, 4 cols)
|
||||
- **microdados/** (115 files, 195.8 MB, 12 cols)
|
||||
|
||||
## br_anatel_indice_brasileiro_conectividade/ (1 tables, 443.4 KB, 1 files)
|
||||
|
||||
- **municipio/** (1 files, 443.4 KB, 11 cols)
|
||||
|
||||
## br_anp_precos_combustiveis/ (1 tables, 79.0 MB, 69 files)
|
||||
|
||||
- **microdados/** (69 files, 79.0 MB, 14 cols)
|
||||
|
||||
## br_ans_beneficiario/ (1 tables, 8.3 GB, 3573 files)
|
||||
|
||||
- **informacao_consolidada/** (3573 files, 8.3 GB, 22 cols)
|
||||
|
||||
## br_bcb_estban/ (3 tables, 2.3 GB, 2207 files)
|
||||
|
||||
- **agencia/** (1524 files, 1.4 GB, 9 cols)
|
||||
- **dicionario/** (1 files, 2.9 KB, 5 cols)
|
||||
- **municipio/** (682 files, 894.3 MB, 10 cols)
|
||||
|
||||
## br_bcb_sicor/ (11 tables, 19.6 GB, 2056 files)
|
||||
|
||||
- **dicionario/** (1 files, 9.3 KB, 5 cols)
|
||||
- **empreendimento/** (1 files, 51.6 KB, 15 cols)
|
||||
- **liberacao/** (65 files, 155.7 MB, 8 cols)
|
||||
- **operacao/** (127 files, 521.6 MB, 53 cols)
|
||||
- **operacoes_desclassificadas/** (1 files, 160.8 KB, 8 cols)
|
||||
- **recurso_publico_complemento_operacao/** (44 files, 244.7 MB, 8 cols)
|
||||
- **recurso_publico_cooperado/** (1 files, 1.2 MB, 9 cols)
|
||||
- **recurso_publico_gleba/** (61 files, 3.8 GB, 7 cols)
|
||||
- **recurso_publico_mutuario/** (29 files, 297.1 MB, 10 cols)
|
||||
- **recurso_publico_propriedade/** (15 files, 491.1 MB, 9 cols)
|
||||
- **saldo/** (1711 files, 14.2 GB, 10 cols)
|
||||
|
||||
## br_bd_diretorios_brasil/ (23 tables, 141.2 MB, 357 files)
|
||||
|
||||
- **area_conhecimento/** (1 files, 21.1 KB, 8 cols)
|
||||
- **cbo_1994/** (1 files, 41.3 KB, 2 cols)
|
||||
- **cbo_2002/** (1 files, 74.3 KB, 11 cols)
|
||||
- **cep/** (1 files, 119.0 MB, 8 cols)
|
||||
- **cid_10/** (1 files, 260.7 KB, 9 cols)
|
||||
- **cid_9/** (1 files, 16.8 KB, 2 cols)
|
||||
- **cnae_1/** (1 files, 23.4 KB, 8 cols)
|
||||
- **cnae_2/** (1 files, 57.8 KB, 14 cols)
|
||||
- **curso_superior/** (1 files, 6.2 KB, 5 cols)
|
||||
- **distrito_1991/** (1 files, 126.6 KB, 4 cols)
|
||||
- **distrito_2000/** (1 files, 145.4 KB, 4 cols)
|
||||
- **distrito_2010/** (1 files, 150.1 KB, 4 cols)
|
||||
- **empresa/** (335 files, 1.5 MB, 32 cols)
|
||||
- **escola/** (1 files, 12.8 MB, 19 cols)
|
||||
- **etnia_indigena/** (1 files, 5.4 KB, 2 cols)
|
||||
- **instituicao_ensino_superior/** (1 files, 118.3 KB, 7 cols)
|
||||
- **municipio/** (1 files, 328.0 KB, 27 cols)
|
||||
- **natureza_juridica/** (1 files, 16.5 KB, 3 cols)
|
||||
- **regiao/** (1 files, 637.0 B, 2 cols)
|
||||
- **setor_censitario_2010/** (1 files, 1.6 MB, 14 cols)
|
||||
- **setor_censitario_2022/** (1 files, 4.9 MB, 22 cols)
|
||||
- **subatividade_ibge/** (1 files, 7.1 KB, 2 cols)
|
||||
- **uf/** (1 files, 1.6 KB, 4 cols)
|
||||
|
||||
## br_bd_diretorios_mundo/ (4 tables, 1.4 MB, 4 files)
|
||||
|
||||
- **continente/** (1 files, 1.0 KB, 3 cols)
|
||||
- **nomenclatura_comum_mercosul/** (1 files, 790.3 KB, 14 cols)
|
||||
- **pais/** (1 files, 19.1 KB, 13 cols)
|
||||
- **sistema_harmonizado/** (1 files, 655.9 KB, 16 cols)
|
||||
|
||||
## br_bd_metadados/ (2 tables, 14.7 MB, 2 files)
|
||||
|
||||
- **bigquery_tables/** (1 files, 68.6 KB, 10 cols)
|
||||
- **prefect_flow_runs/** (1 files, 14.6 MB, 16 cols)
|
||||
|
||||
## br_camara_dados_abertos/ (28 tables, 267.7 MB, 222 files)
|
||||
|
||||
- **deputado/** (1 files, 277.8 KB, 12 cols)
|
||||
- **deputado_ocupacao/** (1 files, 587.2 KB, 6 cols)
|
||||
- **deputado_profissao/** (1 files, 58.0 KB, 5 cols)
|
||||
- **despesa/** (22 files, 125.4 MB, 25 cols)
|
||||
- **evento/** (1 files, 7.2 MB, 11 cols)
|
||||
- **evento_orgao/** (1 files, 322.0 KB, 3 cols)
|
||||
- **evento_presenca_deputado/** (5 files, 6.0 MB, 4 cols)
|
||||
- **evento_requerimento/** (1 files, 287.1 KB, 3 cols)
|
||||
- **frente/** (1 files, 183.9 KB, 10 cols)
|
||||
- **frente_deputado/** (1 files, 767.1 KB, 5 cols)
|
||||
- **funcionario/** (1 files, 316.7 KB, 10 cols)
|
||||
- **legislatura/** (1 files, 2.8 KB, 5 cols)
|
||||
- **legislatura_mesa/** (1 files, 7.9 KB, 13 cols)
|
||||
- **licitacao/** (1 files, 327.5 KB, 18 cols)
|
||||
- **licitacao_contrato/** (1 files, 198.6 KB, 19 cols)
|
||||
- **licitacao_item/** (1 files, 6.9 MB, 21 cols)
|
||||
- **licitacao_pedido/** (1 files, 597.0 KB, 11 cols)
|
||||
- **licitacao_proposta/** (1 files, 738.5 KB, 13 cols)
|
||||
- **orgao/** (1 files, 176.6 KB, 11 cols)
|
||||
- **orgao_deputado/** (1 files, 511.2 KB, 9 cols)
|
||||
- **proposicao_autor/** (5 files, 8.9 MB, 8 cols)
|
||||
- **proposicao_microdados/** (81 files, 72.5 MB, 25 cols)
|
||||
- **proposicao_tema/** (80 files, 2.1 MB, 6 cols)
|
||||
- **votacao/** (1 files, 10.4 MB, 17 cols)
|
||||
- **votacao_objeto/** (3 files, 8.7 MB, 10 cols)
|
||||
- **votacao_orientacao_bancada/** (1 files, 335.7 KB, 5 cols)
|
||||
- **votacao_parlamentar/** (5 files, 10.8 MB, 9 cols)
|
||||
- **votacao_proposicao/** (1 files, 3.1 MB, 10 cols)
|
||||
|
||||
## br_ce_fortaleza_sefin_iptu/ (1 tables, 1.5 MB, 1 files)
|
||||
|
||||
- **face_quadra/** (1 files, 1.5 MB, 13 cols)
|
||||
|
||||
## br_cgu_beneficios_cidadao/ (6 tables, 61.4 GB, 6751 files)
|
||||
|
||||
- **auxilio_brasil/** (643 files, 3.0 GB, 10 cols)
|
||||
- **auxilio_emergencial/** (1426 files, 5.9 GB, 14 cols)
|
||||
- **bolsa_familia_pagamento/** (1479 files, 25.8 GB, 10 cols)
|
||||
- **bpc/** (1667 files, 15.9 GB, 15 cols)
|
||||
- **garantia_safra/** (92 files, 443.9 MB, 7 cols)
|
||||
- **novo_bolsa_familia/** (1444 files, 10.4 GB, 10 cols)
|
||||
|
||||
## br_cgu_cartao_pagamento/ (4 tables, 40.2 MB, 30 files)
|
||||
|
||||
- **dicionario/** (1 files, 2.9 KB, 5 cols)
|
||||
- **microdados_compras_centralizadas/** (12 files, 7.5 MB, 14 cols)
|
||||
- **microdados_defesa_civil/** (1 files, 1.3 MB, 20 cols)
|
||||
- **microdados_governo_federal/** (16 files, 31.5 MB, 15 cols)
|
||||
|
||||
## br_cgu_dados_abertos/ (3 tables, 6.9 MB, 3 files)
|
||||
|
||||
- **conjunto/** (1 files, 1.4 MB, 13 cols)
|
||||
- **organizacao/** (1 files, 119.7 KB, 9 cols)
|
||||
- **recurso/** (1 files, 5.3 MB, 11 cols)
|
||||
|
||||
## br_cgu_emendas_parlamentares/ (1 tables, 2.2 MB, 1 files)
|
||||
|
||||
- **microdados/** (1 files, 2.2 MB, 25 cols)
|
||||
|
||||
## br_cgu_licitacao_contrato/ (8 tables, 2.6 GB, 553 files)
|
||||
|
||||
- **contrato_apostilamento/** (1 files, 532.2 KB, 14 cols)
|
||||
- **contrato_compra/** (1 files, 35.5 MB, 26 cols)
|
||||
- **contrato_item/** (15 files, 65.9 MB, 12 cols)
|
||||
- **contrato_termo_aditivo/** (1 files, 17.2 MB, 12 cols)
|
||||
- **licitacao/** (22 files, 104.1 MB, 19 cols)
|
||||
- **licitacao_empenho/** (39 files, 418.5 MB, 12 cols)
|
||||
- **licitacao_item/** (66 files, 392.4 MB, 16 cols)
|
||||
- **licitacao_participante/** (408 files, 1.6 GB, 15 cols)
|
||||
|
||||
## br_cgu_orcamento_publico/ (1 tables, 8.7 MB, 1 files)
|
||||
|
||||
- **orcamento/** (1 files, 8.7 MB, 26 cols)
|
||||
|
||||
## br_cgu_receitas_publicas/ (1 tables, 15.0 MB, 16 files)
|
||||
|
||||
- **receitas/** (16 files, 15.0 MB, 16 cols)
|
||||
|
||||
## br_cgu_servidores_executivo_federal/ (7 tables, 14.8 GB, 2993 files)
|
||||
|
||||
- **afastamentos/** (27 files, 72.7 MB, 8 cols)
|
||||
- **cadastro_aposentados/** (133 files, 781.5 MB, 30 cols)
|
||||
- **cadastro_pensionistas/** (167 files, 1.1 GB, 34 cols)
|
||||
- **cadastro_reserva_reforma_militares/** (51 files, 223.1 MB, 29 cols)
|
||||
- **cadastro_servidores/** (1413 files, 6.1 GB, 46 cols)
|
||||
- **observacoes/** (73 files, 421.9 MB, 7 cols)
|
||||
- **remuneracao/** (1129 files, 6.0 GB, 40 cols)
|
||||
|
||||
## br_cnj_improbidade_administrativa/ (1 tables, 2.6 MB, 1 files)
|
||||
|
||||
- **condenacao/** (1 files, 2.6 MB, 63 cols)
|
||||
|
||||
## br_cnpq_bolsas/ (2 tables, 86.6 MB, 21 files)
|
||||
|
||||
- **dicionario/** (1 files, 3.6 KB, 5 cols)
|
||||
- **microdados/** (20 files, 86.6 MB, 31 cols)
|
||||
|
||||
## br_cvm_administradores_carteira/ (3 tables, 416.5 KB, 7 files)
|
||||
|
||||
- **pessoa_fisica/** (3 files, 120.3 KB, 7 cols)
|
||||
- **pessoa_juridica/** (3 files, 238.6 KB, 24 cols)
|
||||
- **responsavel/** (1 files, 57.6 KB, 3 cols)
|
||||
|
||||
## br_cvm_oferta_publica_distribuicao/ (1 tables, 1.2 MB, 1 files)
|
||||
|
||||
- **dia/** (1 files, 1.2 MB, 44 cols)
|
||||
|
||||
## br_datahackers_state_data/ (1 tables, 445.7 KB, 1 files)
|
||||
|
||||
- **microdados/** (1 files, 445.7 KB, 353 cols)
|
||||
|
||||
## br_fbsp_absp/ (2 tables, 41.3 KB, 2 files)
|
||||
|
||||
- **uf/** (1 files, 32.6 KB, 29 cols)
|
||||
- **violencia_escola/** (1 files, 8.8 KB, 5 cols)
|
||||
|
||||
## br_fgv_igp/ (7 tables, 111.1 KB, 7 files)
|
||||
|
||||
- **igp_10_mes/** (1 files, 17.4 KB, 7 cols)
|
||||
- **igp_di_ano/** (1 files, 4.2 KB, 5 cols)
|
||||
- **igp_di_mes/** (1 files, 38.7 KB, 7 cols)
|
||||
- **igp_m_ano/** (1 files, 2.9 KB, 5 cols)
|
||||
- **igp_m_mes/** (1 files, 22.8 KB, 9 cols)
|
||||
- **igp_og_ano/** (1 files, 3.3 KB, 5 cols)
|
||||
- **igp_og_mes/** (1 files, 21.8 KB, 7 cols)
|
||||
|
||||
## br_geobr_mapas/ (25 tables, 245.7 MB, 26 files)
|
||||
|
||||
- **amazonia_legal/** (1 files, 212.8 KB, 1 cols)
|
||||
- **area_minima_comparavel_2010/** (1 files, 12.5 MB, 3 cols)
|
||||
- **area_risco_desastre/** (1 files, 1.7 MB, 8 cols)
|
||||
- **arranjo_populacional/** (1 files, 2.1 MB, 8 cols)
|
||||
- **bioma/** (2 files, 15.4 MB, 4 cols)
|
||||
- **concentracao_urbana/** (1 files, 1.5 MB, 8 cols)
|
||||
- **escola/** (1 files, 3.1 MB, 3 cols)
|
||||
- **estabelecimentos_saude/** (1 files, 4.2 MB, 5 cols)
|
||||
- **limite_vizinhanca/** (1 files, 5.1 MB, 12 cols)
|
||||
- **mesorregiao/** (1 files, 3.4 MB, 4 cols)
|
||||
- **microrregiao/** (1 files, 6.7 MB, 4 cols)
|
||||
- **municipio/** (1 files, 17.2 MB, 3 cols)
|
||||
- **pais/** (1 files, 455.7 KB, 1 cols)
|
||||
- **pegada_urbana/** (1 files, 6.3 MB, 6 cols)
|
||||
- **regiao/** (1 files, 884.8 KB, 3 cols)
|
||||
- **regiao_imediata/** (1 files, 6.1 MB, 4 cols)
|
||||
- **regiao_intermediaria/** (1 files, 3.9 MB, 4 cols)
|
||||
- **regiao_metropolitana_2017/** (1 files, 2.8 MB, 8 cols)
|
||||
- **saude/** (1 files, 2.1 MB, 4 cols)
|
||||
- **sede_municipal/** (1 files, 384.0 KB, 8 cols)
|
||||
- **semiarido/** (1 files, 2.0 MB, 3 cols)
|
||||
- **setor_censitario_2010/** (1 files, 139.3 MB, 13 cols)
|
||||
- **terra_indigena/** (1 files, 3.1 MB, 15 cols)
|
||||
- **uf/** (1 files, 1.4 MB, 3 cols)
|
||||
- **unidade_conservacao/** (1 files, 3.8 MB, 14 cols)
|
||||
|
||||
## br_ibge_censo_2022/ (16 tables, 4.1 GB, 512 files)
|
||||
|
||||
- **alfabetizacao_grupo_idade_sexo_raca/** (1 files, 1.7 MB, 6 cols)
|
||||
- **cadastro_enderecos/** (425 files, 2.8 GB, 35 cols)
|
||||
- **caracteristica_domicilio_grupo_idade_raca_destino_lixo/** (11 files, 9.9 MB, 6 cols)
|
||||
- **caracteristica_domicilio_grupo_idade_raca_esgotamento_sanitario/** (14 files, 9.4 MB, 6 cols)
|
||||
- **caracteristica_domicilio_grupo_idade_raca_ligacao_abastecimento_agua/** (20 files, 22.7 MB, 6 cols)
|
||||
- **caracteristica_domicilio_grupo_idade_raca_tipo_domicilio/** (9 files, 5.4 MB, 6 cols)
|
||||
- **dicionario/** (1 files, 2.2 KB, 5 cols)
|
||||
- **domicilio_recenseado/** (1 files, 218.4 KB, 3 cols)
|
||||
- **indice_envelhecimento_raca/** (1 files, 342.8 KB, 6 cols)
|
||||
- **municipio/** (1 files, 177.5 KB, 13 cols)
|
||||
- **populacao_grupo_idade_sexo_raca/** (5 files, 5.4 MB, 6 cols)
|
||||
- **populacao_grupo_idade_uf/** (1 files, 2.9 KB, 3 cols)
|
||||
- **populacao_idade_sexo/** (7 files, 5.7 MB, 7 cols)
|
||||
- **setor_censitario/** (13 files, 1.3 GB, 1423 cols)
|
||||
- **terra_indigena/** (1 files, 14.7 KB, 6 cols)
|
||||
- **territorio_quilombola/** (1 files, 11.8 KB, 6 cols)
|
||||
|
||||
## br_ibge_censo_demografico/ (33 tables, 4.2 GB, 1058 files)
|
||||
|
||||
- **dicionario/** (1 files, 31.0 KB, 5 cols)
|
||||
- **microdados_domicilio_1970/** (17 files, 37.7 MB, 26 cols)
|
||||
- **microdados_domicilio_1980/** (17 files, 27.9 MB, 26 cols)
|
||||
- **microdados_domicilio_1991/** (17 files, 97.5 MB, 43 cols)
|
||||
- **microdados_domicilio_2000/** (25 files, 130.2 MB, 56 cols)
|
||||
- **microdados_domicilio_2010/** (50 files, 176.0 MB, 76 cols)
|
||||
- **microdados_pessoa_1970/** (100 files, 308.6 MB, 41 cols)
|
||||
- **microdados_pessoa_1980/** (150 files, 380.6 MB, 64 cols)
|
||||
- **microdados_pessoa_1991/** (150 files, 636.0 MB, 100 cols)
|
||||
- **microdados_pessoa_2000/** (150 files, 838.2 MB, 110 cols)
|
||||
- **microdados_pessoa_2010/** (350 files, 1.0 GB, 244 cols)
|
||||
- **setor_censitario_alfabetizacao_homens_mulheres_2010/** (1 files, 26.9 MB, 172 cols)
|
||||
- **setor_censitario_alfabetizacao_total_2010/** (1 files, 16.6 MB, 87 cols)
|
||||
- **setor_censitario_basico_2010/** (1 files, 16.7 MB, 14 cols)
|
||||
- **setor_censitario_domicilio_caracteristicas_gerais_2010/** (2 files, 32.4 MB, 243 cols)
|
||||
- **setor_censitario_domicilio_moradores_2010/** (1 files, 32.8 MB, 134 cols)
|
||||
- **setor_censitario_domicilio_renda_2010/** (1 files, 5.9 MB, 16 cols)
|
||||
- **setor_censitario_entorno_2010/** (5 files, 175.9 MB, 1064 cols)
|
||||
- **setor_censitario_idade_homens_2010/** (1 files, 17.4 MB, 136 cols)
|
||||
- **setor_censitario_idade_mulheres_2010/** (1 files, 17.8 MB, 136 cols)
|
||||
- **setor_censitario_idade_total_2010/** (1 files, 21.7 MB, 136 cols)
|
||||
- **setor_censitario_pessoa_renda_2010/** (1 files, 55.6 MB, 134 cols)
|
||||
- **setor_censitario_raca_alfabetizacao_idade_genero_2010/** (1 files, 18.8 MB, 157 cols)
|
||||
- **setor_censitario_raca_idade_0_4_genero_2010/** (1 files, 2.5 MB, 12 cols)
|
||||
- **setor_censitario_raca_idade_genero_2010/** (2 files, 33.1 MB, 253 cols)
|
||||
- **setor_censitario_registro_civil_2010/** (1 files, 1.4 MB, 5 cols)
|
||||
- **setor_censitario_relacao_parentesco_conjuges_2010/** (1 files, 18.5 MB, 215 cols)
|
||||
- **setor_censitario_relacao_parentesco_filhos_2010/** (1 files, 18.7 MB, 206 cols)
|
||||
- **setor_censitario_relacao_parentesco_filhos_enteados_2010/** (2 files, 15.9 MB, 256 cols)
|
||||
- **setor_censitario_relacao_parentesco_outros_2010/** (2 files, 12.4 MB, 242 cols)
|
||||
- **setor_censitario_responsavel_domicilios_homens_total_2010/** (2 files, 24.7 MB, 218 cols)
|
||||
- **setor_censitario_responsavel_domicilios_mulheres_2010/** (1 files, 9.9 MB, 110 cols)
|
||||
- **setor_censitario_responsavel_renda_2010/** (1 files, 48.2 MB, 134 cols)
|
||||
|
||||
## br_ibge_estadic/ (1 tables, 4.5 KB, 1 files)
|
||||
|
||||
- **dicionario/** (1 files, 4.5 KB, 5 cols)
|
||||
|
||||
## br_ibge_inpc/ (4 tables, 3.6 MB, 4 files)
|
||||
|
||||
- **mes_brasil/** (1 files, 17.8 KB, 8 cols)
|
||||
- **mes_categoria_brasil/** (1 files, 283.8 KB, 8 cols)
|
||||
- **mes_categoria_municipio/** (1 files, 1.5 MB, 10 cols)
|
||||
- **mes_categoria_rm/** (1 files, 1.8 MB, 10 cols)
|
||||
|
||||
## br_ibge_ipca/ (4 tables, 3.6 MB, 4 files)
|
||||
|
||||
- **mes_brasil/** (1 files, 17.3 KB, 8 cols)
|
||||
- **mes_categoria_brasil/** (1 files, 286.3 KB, 8 cols)
|
||||
- **mes_categoria_municipio/** (1 files, 1.5 MB, 10 cols)
|
||||
- **mes_categoria_rm/** (1 files, 1.8 MB, 10 cols)
|
||||
|
||||
## br_ibge_ipca15/ (4 tables, 2.3 MB, 4 files)
|
||||
|
||||
- **mes_brasil/** (1 files, 9.4 KB, 8 cols)
|
||||
- **mes_categoria_brasil/** (1 files, 279.2 KB, 8 cols)
|
||||
- **mes_categoria_municipio/** (1 files, 411.5 KB, 10 cols)
|
||||
- **mes_categoria_rm/** (1 files, 1.6 MB, 10 cols)
|
||||
|
||||
## br_ibge_pam/ (2 tables, 67.3 MB, 148 files)
|
||||
|
||||
- **lavoura_permanente/** (75 files, 32.9 MB, 9 cols)
|
||||
- **lavoura_temporaria/** (73 files, 34.4 MB, 9 cols)
|
||||
|
||||
## br_ibge_pevs/ (2 tables, 3.3 MB, 74 files)
|
||||
|
||||
- **producao_extracao_vegetal/** (37 files, 2.5 MB, 7 cols)
|
||||
- **producao_silvicultura/** (37 files, 805.8 KB, 9 cols)
|
||||
|
||||
## br_ibge_pib/ (2 tables, 3.6 MB, 2 files)
|
||||
|
||||
- **gini/** (1 files, 24.8 KB, 7 cols)
|
||||
- **municipio/** (1 files, 3.6 MB, 9 cols)
|
||||
|
||||
## br_ibge_pnad/ (3 tables, 116.5 MB, 57 files)
|
||||
|
||||
- **dicionario/** (1 files, 3.3 KB, 5 cols)
|
||||
- **microdados_compatibilizados_domicilio/** (17 files, 19.0 MB, 39 cols)
|
||||
- **microdados_compatibilizados_pessoa/** (39 files, 97.5 MB, 70 cols)
|
||||
|
||||
## br_ibge_pnad_covid/ (1 tables, 7.4 KB, 1 files)
|
||||
|
||||
- **dicionario/** (1 files, 7.4 KB, 5 cols)
|
||||
|
||||
## br_ibge_pnadc/ (4 tables, 27.4 GB, 1377 files)
|
||||
|
||||
- **dicionario/** (1 files, 22.4 KB, 5 cols)
|
||||
- **educacao/** (39 files, 1.1 GB, 279 cols)
|
||||
- **microdados/** (1287 files, 25.6 GB, 424 cols)
|
||||
- **rendimentos_outras_fontes/** (50 files, 707.1 MB, 293 cols)
|
||||
|
||||
## br_ibge_pof/ (1 tables, 38.7 KB, 1 files)
|
||||
|
||||
- **dicionario/** (1 files, 38.7 KB, 5 cols)
|
||||
|
||||
## br_ibge_populacao/ (3 tables, 646.5 KB, 3 files)
|
||||
|
||||
- **brasil/** (1 files, 1.1 KB, 2 cols)
|
||||
- **municipio/** (1 files, 639.4 KB, 4 cols)
|
||||
- **uf/** (1 files, 6.1 KB, 3 cols)
|
||||
|
||||
## br_ibge_ppm/ (4 tables, 12.7 MB, 160 files)
|
||||
|
||||
- **efetivo_rebanhos/** (52 files, 5.8 MB, 5 cols)
|
||||
- **producao_aquicultura/** (10 files, 578.7 KB, 6 cols)
|
||||
- **producao_origem_animal/** (49 files, 4.5 MB, 7 cols)
|
||||
- **producao_pecuaria/** (49 files, 1.9 MB, 5 cols)
|
||||
|
||||
## br_inep_ana/ (1 tables, 4.1 KB, 1 files)
|
||||
|
||||
- **dicionario/** (1 files, 4.1 KB, 5 cols)
|
||||
|
||||
## br_inep_avaliacao_alfabetizacao/ (7 tables, 57.2 MB, 17 files)
|
||||
|
||||
- **alunos/** (11 files, 56.6 MB, 12 cols)
|
||||
- **dicionario/** (1 files, 1.8 KB, 5 cols)
|
||||
- **meta_alfabetizacao_brasil/** (1 files, 4.0 KB, 11 cols)
|
||||
- **meta_alfabetizacao_municipio/** (1 files, 159.7 KB, 13 cols)
|
||||
- **meta_alfabetizacao_uf/** (1 files, 6.4 KB, 12 cols)
|
||||
- **municipio/** (1 files, 411.7 KB, 15 cols)
|
||||
- **uf/** (1 files, 10.6 KB, 15 cols)
|
||||
|
||||
## br_inep_censo_educacao_superior/ (3 tables, 190.4 MB, 129 files)
|
||||
|
||||
- **curso/** (111 files, 186.0 MB, 193 cols)
|
||||
- **dicionario/** (1 files, 2.1 KB, 5 cols)
|
||||
- **ies/** (17 files, 4.4 MB, 71 cols)
|
||||
|
||||
## br_inep_censo_escolar/ (3 tables, 684.8 MB, 340 files)
|
||||
|
||||
- **dicionario/** (1 files, 5.1 KB, 5 cols)
|
||||
- **escola/** (112 files, 243.3 MB, 455 cols)
|
||||
- **turma/** (227 files, 441.5 MB, 76 cols)
|
||||
|
||||
## br_inep_educacao_especial/ (15 tables, 30.7 MB, 145 files)
|
||||
|
||||
- **brasil_distorcao_idade_serie/** (1 files, 1.4 KB, 3 cols)
|
||||
- **brasil_taxa_rendimento/** (1 files, 2.5 KB, 5 cols)
|
||||
- **distorcao_idade_serie/** (1 files, 5.6 KB, 4 cols)
|
||||
- **docente_aee/** (1 files, 105.5 KB, 7 cols)
|
||||
- **docente_formacao/** (1 files, 247.6 KB, 5 cols)
|
||||
- **etapa_ensino/** (24 files, 6.6 MB, 6 cols)
|
||||
- **faixa_etaria/** (21 files, 3.5 MB, 6 cols)
|
||||
- **localizacao/** (22 files, 4.1 MB, 7 cols)
|
||||
- **matricula_aee/** (1 files, 4.8 KB, 5 cols)
|
||||
- **sexo_raca_cor/** (23 files, 6.1 MB, 7 cols)
|
||||
- **taxa_rendimento/** (1 files, 11.0 KB, 6 cols)
|
||||
- **tempo_ensino/** (22 files, 4.1 MB, 7 cols)
|
||||
- **tipo_deficiencia/** (24 files, 6.0 MB, 6 cols)
|
||||
- **uf_distorcao_idade_serie/** (1 files, 4.8 KB, 4 cols)
|
||||
- **uf_taxa_rendimento/** (1 files, 8.9 KB, 6 cols)
|
||||
|
||||
## br_inep_enem/ (28 tables, 7.6 GB, 1631 files)
|
||||
|
||||
- **dicionario/** (1 files, 50.9 KB, 5 cols)
|
||||
- **microdados/** (845 files, 5.8 GB, 63 cols)
|
||||
- **questionario_socioeconomico_1998/** (1 files, 3.8 MB, 138 cols)
|
||||
- **questionario_socioeconomico_1999/** (1 files, 7.5 MB, 130 cols)
|
||||
- **questionario_socioeconomico_2000/** (1 files, 8.3 MB, 128 cols)
|
||||
- **questionario_socioeconomico_2001/** (50 files, 74.3 MB, 243 cols)
|
||||
- **questionario_socioeconomico_2002/** (22 files, 68.8 MB, 220 cols)
|
||||
- **questionario_socioeconomico_2003/** (19 files, 55.7 MB, 189 cols)
|
||||
- **questionario_socioeconomico_2004/** (17 files, 45.0 MB, 206 cols)
|
||||
- **questionario_socioeconomico_2005/** (50 files, 95.8 MB, 224 cols)
|
||||
- **questionario_socioeconomico_2006/** (50 files, 122.5 MB, 224 cols)
|
||||
- **questionario_socioeconomico_2007/** (50 files, 135.1 MB, 224 cols)
|
||||
- **questionario_socioeconomico_2008/** (50 files, 125.6 MB, 224 cols)
|
||||
- **questionario_socioeconomico_2009/** (100 files, 145.6 MB, 294 cols)
|
||||
- **questionario_socioeconomico_2010/** (17 files, 44.6 MB, 58 cols)
|
||||
- **questionario_socioeconomico_2011/** (21 files, 65.8 MB, 76 cols)
|
||||
- **questionario_socioeconomico_2012/** (20 files, 66.1 MB, 63 cols)
|
||||
- **questionario_socioeconomico_2013/** (50 files, 98.8 MB, 77 cols)
|
||||
- **questionario_socioeconomico_2014/** (50 files, 115.2 MB, 77 cols)
|
||||
- **questionario_socioeconomico_2015/** (50 files, 103.8 MB, 51 cols)
|
||||
- **questionario_socioeconomico_2016/** (50 files, 114.9 MB, 51 cols)
|
||||
- **questionario_socioeconomico_2017/** (17 files, 58.3 MB, 28 cols)
|
||||
- **questionario_socioeconomico_2018/** (17 files, 49.8 MB, 28 cols)
|
||||
- **questionario_socioeconomico_2019/** (17 files, 43.3 MB, 26 cols)
|
||||
- **questionario_socioeconomico_2020/** (17 files, 47.8 MB, 26 cols)
|
||||
- **questionario_socioeconomico_2021/** (15 files, 29.2 MB, 26 cols)
|
||||
- **questionario_socioeconomico_2022/** (16 files, 30.2 MB, 26 cols)
|
||||
- **questionario_socioeconomico_2023/** (17 files, 34.6 MB, 26 cols)
|
||||
|
||||
## br_inep_formacao_docente/ (1 tables, 1.9 KB, 1 files)
|
||||
|
||||
- **dicionario/** (1 files, 1.9 KB, 5 cols)
|
||||
|
||||
## br_inep_ideb/ (5 tables, 27.0 MB, 9 files)
|
||||
|
||||
- **brasil/** (1 files, 8.6 KB, 11 cols)
|
||||
- **escola/** (5 files, 21.3 MB, 14 cols)
|
||||
- **municipio/** (1 files, 5.5 MB, 13 cols)
|
||||
- **regiao/** (1 files, 21.0 KB, 12 cols)
|
||||
- **uf/** (1 files, 87.7 KB, 12 cols)
|
||||
|
||||
## br_inep_indicador_nivel_socioeconomico/ (2 tables, 5.8 MB, 2 files)
|
||||
|
||||
- **dicionario/** (1 files, 3.5 KB, 5 cols)
|
||||
- **escola/** (1 files, 5.8 MB, 18 cols)
|
||||
|
||||
## br_inep_indicadores_educacionais/ (11 tables, 395.6 MB, 107 files)
|
||||
|
||||
- **brasil/** (19 files, 1.5 MB, 214 cols)
|
||||
- **brasil_remuneracao_docentes/** (1 files, 11.7 KB, 12 cols)
|
||||
- **brasil_taxa_transicao/** (1 files, 42.5 KB, 67 cols)
|
||||
- **escola/** (34 files, 218.8 MB, 208 cols)
|
||||
- **municipio/** (32 files, 152.6 MB, 215 cols)
|
||||
- **municipio_taxa_transicao/** (15 files, 19.9 MB, 68 cols)
|
||||
- **regiao/** (1 files, 526.3 KB, 215 cols)
|
||||
- **regiao_taxa_transicao/** (1 files, 80.2 KB, 68 cols)
|
||||
- **uf/** (1 files, 1.8 MB, 215 cols)
|
||||
- **uf_remuneracao_docentes/** (1 files, 106.9 KB, 13 cols)
|
||||
- **uf_taxa_transicao/** (1 files, 172.7 KB, 68 cols)
|
||||
|
||||
## br_inep_saeb/ (11 tables, 7.9 GB, 1025 files)
|
||||
|
||||
- **aluno_ef_2ano/** (3 files, 8.7 MB, 38 cols)
|
||||
- **aluno_ef_5ano/** (321 files, 2.4 GB, 243 cols)
|
||||
- **aluno_ef_9ano/** (386 files, 2.6 GB, 267 cols)
|
||||
- **aluno_em_34ano/** (50 files, 241.0 MB, 105 cols)
|
||||
- **brasil/** (1 files, 59.0 KB, 17 cols)
|
||||
- **brasil_taxa_alfabetizacao/** (1 files, 2.1 KB, 5 cols)
|
||||
- **dicionario/** (1 files, 19.2 KB, 5 cols)
|
||||
- **municipio/** (11 files, 32.1 MB, 19 cols)
|
||||
- **proficiencia/** (249 files, 2.6 GB, 21 cols)
|
||||
- **uf/** (1 files, 1.1 MB, 18 cols)
|
||||
- **uf_taxa_alfabetizacao/** (1 files, 7.4 KB, 6 cols)
|
||||
|
||||
## br_inep_sinopse_estatistica_educacao_basica/ (18 tables, 256.2 MB, 559 files)
|
||||
|
||||
- **dicionario/** (1 files, 2.3 KB, 5 cols)
|
||||
- **docente_deficiencia/** (16 files, 2.8 MB, 6 cols)
|
||||
- **docente_escolaridade/** (36 files, 19.6 MB, 6 cols)
|
||||
- **docente_etapa_ensino/** (52 files, 34.4 MB, 7 cols)
|
||||
- **docente_faixa_etaria_sexo/** (54 files, 36.6 MB, 7 cols)
|
||||
- **docente_localizacao/** (61 files, 43.0 MB, 7 cols)
|
||||
- **docente_regime_contrato/** (38 files, 21.5 MB, 7 cols)
|
||||
- **educacao_especial_etapa_ensino/** (23 files, 2.8 MB, 6 cols)
|
||||
- **educacao_especial_faixa_etaria/** (19 files, 1.7 MB, 6 cols)
|
||||
- **educacao_especial_localizacao/** (20 files, 3.8 MB, 7 cols)
|
||||
- **educacao_especial_sexo_raca_cor/** (22 files, 5.5 MB, 7 cols)
|
||||
- **educacao_especial_tempo_ensino/** (20 files, 3.8 MB, 7 cols)
|
||||
- **educacao_especial_tipo_deficiencia/** (22 files, 3.0 MB, 6 cols)
|
||||
- **etapa_ensino_serie/** (38 files, 23.1 MB, 7 cols)
|
||||
- **faixa_etaria/** (27 files, 7.0 MB, 6 cols)
|
||||
- **localizacao/** (35 files, 17.9 MB, 7 cols)
|
||||
- **sexo_raca_cor/** (43 files, 15.8 MB, 7 cols)
|
||||
- **tempo_ensino/** (32 files, 14.0 MB, 7 cols)
|
||||
|
||||
## br_inmet_bdmep/ (1 tables, 1.3 GB, 210 files)
|
||||
|
||||
- **microdados/** (210 files, 1.3 GB, 22 cols)
|
||||
|
||||
## br_inpe_prodes/ (1 tables, 862.4 KB, 1 files)
|
||||
|
||||
- **municipio_bioma/** (1 files, 862.4 KB, 8 cols)
|
||||
|
||||
## br_inpe_queimadas/ (1 tables, 268.1 MB, 65 files)
|
||||
|
||||
- **microdados/** (65 files, 268.1 MB, 13 cols)
|
||||
|
||||
## br_inpe_sisam/ (1 tables, 1.5 GB, 417 files)
|
||||
|
||||
- **microdados/** (417 files, 1.5 GB, 14 cols)
|
||||
|
||||
## br_ipea_avs/ (1 tables, 35.5 MB, 1 files)
|
||||
|
||||
- **municipio/** (1 files, 35.5 MB, 92 cols)
|
||||
|
||||
## br_mdr_snis/ (2 tables, 63.0 MB, 2 files)
|
||||
|
||||
- **municipio_agua_esgoto/** (1 files, 31.3 MB, 133 cols)
|
||||
- **prestador_agua_esgoto/** (1 files, 31.7 MB, 144 cols)
|
||||
|
||||
## br_me_caged/ (4 tables, 1.6 GB, 705 files)
|
||||
|
||||
- **dicionario/** (1 files, 38.0 KB, 5 cols)
|
||||
- **microdados_movimentacao/** (689 files, 1.5 GB, 25 cols)
|
||||
- **microdados_movimentacao_excluida/** (1 files, 5.2 MB, 30 cols)
|
||||
- **microdados_movimentacao_fora_prazo/** (14 files, 71.1 MB, 27 cols)
|
||||
|
||||
## br_me_cno/ (1 tables, 1.9 KB, 1 files)
|
||||
|
||||
- **dicionario/** (1 files, 1.9 KB, 5 cols)
|
||||
|
||||
## br_me_cnpj/ (5 tables, 194.1 GB, 8473 files)
|
||||
|
||||
- **dicionario/** (1 files, 8.2 KB, 5 cols)
|
||||
- **empresas/** (3070 files, 45.0 GB, 10 cols)
|
||||
- **estabelecimentos/** (3353 files, 128.6 GB, 35 cols)
|
||||
- **simples/** (100 files, 283.9 MB, 7 cols)
|
||||
- **socios/** (1949 files, 20.2 GB, 14 cols)
|
||||
|
||||
## br_me_comex_stat/ (5 tables, 1.1 GB, 445 files)
|
||||
|
||||
- **dicionario/** (1 files, 11.4 KB, 5 cols)
|
||||
- **municipio_exportacao/** (83 files, 154.6 MB, 9 cols)
|
||||
- **municipio_importacao/** (113 files, 217.7 MB, 9 cols)
|
||||
- **ncm_exportacao/** (106 files, 264.4 MB, 12 cols)
|
||||
- **ncm_importacao/** (142 files, 501.1 MB, 14 cols)
|
||||
|
||||
## br_me_rais/ (3 tables, 51.9 GB, 3541 files)
|
||||
|
||||
- **dicionario/** (1 files, 54.5 KB, 5 cols)
|
||||
- **microdados_estabelecimentos/** (566 files, 803.2 MB, 26 cols)
|
||||
- **microdados_vinculos/** (2974 files, 51.1 GB, 66 cols)
|
||||
|
||||
## br_me_sic/ (2 tables, 200.4 KB, 2 files)
|
||||
|
||||
- **dicionario/** (1 files, 11.7 KB, 5 cols)
|
||||
- **transferencia/** (1 files, 188.7 KB, 17 cols)
|
||||
|
||||
## br_me_siconfi/ (7 tables, 441.6 MB, 281 files)
|
||||
|
||||
- **municipio_balanco_patrimonial/** (27 files, 51.4 MB, 8 cols)
|
||||
- **municipio_despesas_funcao/** (60 files, 127.7 MB, 10 cols)
|
||||
- **municipio_despesas_orcamentarias/** (89 files, 143.9 MB, 10 cols)
|
||||
- **municipio_receitas_orcamentarias/** (72 files, 115.4 MB, 10 cols)
|
||||
- **uf_despesas_funcao/** (11 files, 1.4 MB, 10 cols)
|
||||
- **uf_despesas_orcamentarias/** (11 files, 1.0 MB, 10 cols)
|
||||
- **uf_receitas_orcamentarias/** (11 files, 792.3 KB, 10 cols)
|
||||
|
||||
## br_mec_prouni/ (1 tables, 1.7 KB, 1 files)
|
||||
|
||||
- **dicionario/** (1 files, 1.7 KB, 5 cols)
|
||||
|
||||
## br_mec_sisu/ (1 tables, 1.4 GB, 314 files)
|
||||
|
||||
- **microdados/** (314 files, 1.4 GB, 52 cols)
|
||||
|
||||
## br_mg_belohorizonte_smfa_iptu/ (2 tables, 2.4 GB, 221 files)
|
||||
|
||||
- **dicionario/** (1 files, 1.7 KB, 5 cols)
|
||||
- **iptu/** (220 files, 2.4 GB, 26 cols)
|
||||
|
||||
## br_mme_consumo_energia_eletrica/ (1 tables, 370.2 KB, 1 files)
|
||||
|
||||
- **uf/** (1 files, 370.2 KB, 6 cols)
|
||||
|
||||
## br_mp_pep/ (1 tables, 7.4 MB, 7 files)
|
||||
|
||||
- **cargos_funcoes/** (7 files, 7.4 MB, 16 cols)
|
||||
|
||||
## br_ms_cnes/ (14 tables, 24.4 GB, 1424 files)
|
||||
|
||||
- **dados_complementares/** (52 files, 18.5 MB, 94 cols)
|
||||
- **dicionario/** (1 files, 25.1 KB, 5 cols)
|
||||
- **equipamento/** (52 files, 863.1 MB, 11 cols)
|
||||
- **equipe/** (41 files, 105.0 MB, 24 cols)
|
||||
- **estabelecimento/** (214 files, 1.7 GB, 204 cols)
|
||||
- **estabelecimento_ensino/** (13 files, 135.6 KB, 14 cols)
|
||||
- **estabelecimento_filantropico/** (19 files, 320.4 KB, 14 cols)
|
||||
- **gestao_metas/** (19 files, 746.3 KB, 15 cols)
|
||||
- **habilitacao/** (41 files, 15.6 MB, 16 cols)
|
||||
- **incentivos/** (24 files, 3.0 MB, 15 cols)
|
||||
- **leito/** (41 files, 15.7 MB, 10 cols)
|
||||
- **profissional/** (820 files, 21.0 GB, 23 cols)
|
||||
- **regra_contratual/** (17 files, 4.3 MB, 15 cols)
|
||||
- **servico_especializado/** (70 files, 788.0 MB, 15 cols)
|
||||
|
||||
## br_ms_pns/ (3 tables, 51.9 MB, 8 files)
|
||||
|
||||
- **dicionario/** (1 files, 36.3 KB, 5 cols)
|
||||
- **microdados_2013/** (3 files, 20.6 MB, 1000 cols)
|
||||
- **microdados_2019/** (4 files, 31.2 MB, 1087 cols)
|
||||
|
||||
## br_ms_populacao/ (1 tables, 16.3 MB, 9 files)
|
||||
|
||||
- **municipio/** (9 files, 16.3 MB, 5 cols)
|
||||
|
||||
## br_ms_sia/ (3 tables, 46.2 GB, 7629 files)
|
||||
|
||||
- **dicionario/** (1 files, 129.3 KB, 5 cols)
|
||||
- **producao_ambulatorial/** (7159 files, 45.3 GB, 59 cols)
|
||||
- **psicossocial/** (469 files, 962.4 MB, 41 cols)
|
||||
|
||||
## br_ms_sih/ (3 tables, 31.6 GB, 5824 files)
|
||||
|
||||
- **aihs_reduzidas/** (1794 files, 7.6 GB, 109 cols)
|
||||
- **dicionario/** (1 files, 206.7 KB, 5 cols)
|
||||
- **servicos_profissionais/** (4029 files, 23.9 GB, 37 cols)
|
||||
|
||||
## br_ms_sim/ (2 tables, 872.2 MB, 138 files)
|
||||
|
||||
- **dicionario/** (1 files, 6.6 KB, 5 cols)
|
||||
- **microdados/** (137 files, 872.2 MB, 92 cols)
|
||||
|
||||
## br_ms_sinan/ (3 tables, 616.2 MB, 215 files)
|
||||
|
||||
- **dicionario/** (1 files, 7.2 KB, 5 cols)
|
||||
- **microdados_dengue/** (179 files, 503.7 MB, 151 cols)
|
||||
- **microdados_influenza_srag/** (35 files, 112.5 MB, 205 cols)
|
||||
|
||||
## br_ms_sinasc/ (2 tables, 1.4 GB, 352 files)
|
||||
|
||||
- **dicionario/** (1 files, 6.4 KB, 5 cols)
|
||||
- **microdados/** (351 files, 1.4 GB, 66 cols)
|
||||
|
||||
## br_ms_sisvan/ (2 tables, 19.2 GB, 1540 files)
|
||||
|
||||
- **dicionario/** (1 files, 2.3 KB, 5 cols)
|
||||
- **microdados/** (1539 files, 19.2 GB, 28 cols)
|
||||
|
||||
## br_ms_vacinacao_covid19/ (1 tables, 3.8 KB, 1 files)
|
||||
|
||||
- **dicionario/** (1 files, 3.8 KB, 5 cols)
|
||||
|
||||
## br_poder360_pesquisas/ (1 tables, 1.3 MB, 1 files)
|
||||
|
||||
- **microdados/** (1 files, 1.3 MB, 24 cols)
|
||||
|
||||
## br_rf_arrecadacao/ (5 tables, 5.9 MB, 57 files)
|
||||
|
||||
- **cnae/** (9 files, 352.9 KB, 20 cols)
|
||||
- **ir_ipi/** (6 files, 41.0 KB, 10 cols)
|
||||
- **itr/** (8 files, 3.3 MB, 5 cols)
|
||||
- **natureza_juridica/** (9 files, 577.9 KB, 20 cols)
|
||||
- **uf/** (25 files, 1.7 MB, 45 cols)
|
||||
|
||||
## br_rf_cafir/ (2 tables, 3.6 GB, 450 files)
|
||||
|
||||
- **dicionario/** (1 files, 2.1 KB, 5 cols)
|
||||
- **imoveis_rurais/** (449 files, 3.6 GB, 14 cols)
|
||||
|
||||
## br_rf_cno/ (5 tables, 32.5 GB, 2110 files)
|
||||
|
||||
- **areas/** (1236 files, 4.6 GB, 8 cols)
|
||||
- **cnaes/** (489 files, 2.3 GB, 4 cols)
|
||||
- **dicionario/** (1 files, 2.1 KB, 5 cols)
|
||||
- **microdados/** (193 files, 25.1 GB, 25 cols)
|
||||
- **vinculos/** (191 files, 497.9 MB, 7 cols)
|
||||
|
||||
## br_rj_isp_estatisticas_seguranca/ (14 tables, 2.8 MB, 14 files)
|
||||
|
||||
- **armas_apreendidas_mensal/** (1 files, 229.4 KB, 42 cols)
|
||||
- **armas_fogo_apreendidas_mensal/** (1 files, 11.1 KB, 7 cols)
|
||||
- **evolucao_mensal_cisp/** (1 files, 1005.0 KB, 61 cols)
|
||||
- **evolucao_mensal_municipio/** (1 files, 361.0 KB, 58 cols)
|
||||
- **evolucao_mensal_uf/** (1 files, 60.2 KB, 56 cols)
|
||||
- **evolucao_mensal_upp/** (1 files, 89.2 KB, 38 cols)
|
||||
- **evolucao_policial_morto_servico_mensal/** (1 files, 50.4 KB, 5 cols)
|
||||
- **feminicidio_mensal_cisp/** (1 files, 17.4 KB, 9 cols)
|
||||
- **relacao_cisp_aisp_risp/** (1 files, 7.2 KB, 6 cols)
|
||||
- **taxa_evolucao_anual_municipio/** (1 files, 75.7 KB, 56 cols)
|
||||
- **taxa_evolucao_anual_uf/** (1 files, 29.5 KB, 55 cols)
|
||||
- **taxa_evolucao_mensal_municipio/** (1 files, 852.5 KB, 58 cols)
|
||||
- **taxa_evolucao_mensal_uf/** (1 files, 59.9 KB, 56 cols)
|
||||
- **taxa_letalidade/** (1 files, 6.6 KB, 6 cols)
|
||||
|
||||
## br_seeg_emissoes/ (3 tables, 2.5 GB, 557 files)
|
||||
|
||||
- **dicionario/** (1 files, 13.7 KB, 5 cols)
|
||||
- **municipio/** (457 files, 2.5 GB, 17 cols)
|
||||
- **uf/** (99 files, 87.7 MB, 13 cols)
|
||||
|
||||
## br_sfb_sicar/ (2 tables, 28.3 GB, 947 files)
|
||||
|
||||
- **area_imovel/** (946 files, 28.3 GB, 11 cols)
|
||||
- **dicionario/** (1 files, 1.7 KB, 5 cols)
|
||||
|
||||
## br_simet_educacao_conectada/ (1 tables, 10.0 MB, 1 files)
|
||||
|
||||
- **escola/** (1 files, 10.0 MB, 54 cols)
|
||||
|
||||
## br_sp_saopaulo_geosampa_iptu/ (1 tables, 2.3 GB, 447 files)
|
||||
|
||||
- **iptu/** (447 files, 2.3 GB, 27 cols)
|
||||
|
||||
## br_stf_corte_aberta/ (2 tables, 87.3 MB, 33 files)
|
||||
|
||||
- **decisoes/** (32 files, 87.3 MB, 17 cols)
|
||||
- **dicionario/** (1 files, 2.3 KB, 5 cols)
|
||||
|
||||
## br_trase_supply_chain/ (6 tables, 59.9 MB, 24 files)
|
||||
|
||||
- **beef/** (3 files, 44.2 MB, 22 cols)
|
||||
- **beef_slaughterhouses/** (1 files, 718.9 KB, 19 cols)
|
||||
- **soy_beans/** (17 files, 14.8 MB, 25 cols)
|
||||
- **soy_beans_crushing_facilities/** (1 files, 19.6 KB, 13 cols)
|
||||
- **soy_beans_refining_facilities/** (1 files, 9.4 KB, 9 cols)
|
||||
- **soy_beans_storage_facilities/** (1 files, 225.1 KB, 12 cols)
|
||||
|
||||
## br_tse_eleicoes/ (22 tables, 8.2 GB, 4324 files)
|
||||
|
||||
- **bens_candidato/** (18 files, 134.1 MB, 10 cols)
|
||||
- **candidatos/** (29 files, 149.3 MB, 28 cols)
|
||||
- **despesas_candidato/** (255 files, 1.5 GB, 45 cols)
|
||||
- **detalhes_votacao_municipio/** (16 files, 17.4 MB, 25 cols)
|
||||
- **detalhes_votacao_municipio_zona/** (16 files, 19.7 MB, 26 cols)
|
||||
- **detalhes_votacao_secao/** (90 files, 401.2 MB, 24 cols)
|
||||
- **dicionario/** (1 files, 2.3 KB, 5 cols)
|
||||
- **partidos/** (21 files, 7.9 MB, 21 cols)
|
||||
- **perfil_eleitorado_local_votacao/** (15 files, 106.1 MB, 23 cols)
|
||||
- **perfil_eleitorado_municipio_zona/** (98 files, 128.2 MB, 13 cols)
|
||||
- **perfil_eleitorado_secao/** (1425 files, 1.9 GB, 15 cols)
|
||||
- **receitas_candidato/** (108 files, 806.4 MB, 55 cols)
|
||||
- **receitas_comite/** (7 files, 7.9 MB, 36 cols)
|
||||
- **receitas_orgao_partidario/** (7 files, 9.0 MB, 50 cols)
|
||||
- **resultados_candidato/** (37 files, 43.7 MB, 16 cols)
|
||||
- **resultados_candidato_municipio/** (99 files, 126.6 MB, 16 cols)
|
||||
- **resultados_candidato_municipio_zona/** (125 files, 187.0 MB, 17 cols)
|
||||
- **resultados_candidato_secao/** (1367 files, 2.0 GB, 17 cols)
|
||||
- **resultados_partido_municipio/** (25 files, 17.4 MB, 13 cols)
|
||||
- **resultados_partido_municipio_zona/** (26 files, 19.6 MB, 14 cols)
|
||||
- **resultados_partido_secao/** (523 files, 608.5 MB, 15 cols)
|
||||
- **vagas/** (16 files, 594.7 KB, 9 cols)
|
||||
|
||||
## br_tse_filiacao_partidaria/ (2 tables, 941.9 MB, 106 files)
|
||||
|
||||
- **microdados/** (44 files, 483.5 MB, 22 cols)
|
||||
- **microdados_antigos/** (62 files, 458.4 MB, 16 cols)
|
||||
|
||||
## dataset_new_arch/ (1 tables, 1.6 KB, 1 files)
|
||||
|
||||
- **tabela_new_arch/** (1 files, 1.6 KB, 5 cols)
|
||||
|
||||
## logs/ (2 tables, 5.5 GB, 4675 files)
|
||||
|
||||
- **cloudaudit_googleapis_com_activity/** (1826 files, 570.1 MB, 18 cols)
|
||||
- **cloudaudit_googleapis_com_data_access/** (2849 files, 4.9 GB, 18 cols)
|
||||
|
||||
## mundo_transfermarkt_competicoes/ (2 tables, 529.9 KB, 25 files)
|
||||
|
||||
- **brasileirao_serie_a/** (22 files, 458.6 KB, 35 cols)
|
||||
- **copa_brasil/** (3 files, 71.3 KB, 38 cols)
|
||||
|
||||
## mundo_transfermarkt_competicoes_internacionais/ (1 tables, 114.2 KB, 1 files)
|
||||
|
||||
- **champions_league/** (1 files, 114.2 KB, 55 cols)
|
||||
|
||||
## test_dataset/ (1 tables, 1.3 KB, 1 files)
|
||||
|
||||
- **test_table/** (1 files, 1.3 KB, 4 cols)
|
||||
|
||||
## us_harvard_ned/ (2 tables, 26.8 MB, 423 files)
|
||||
|
||||
- **parliamentary_elections/** (233 files, 13.4 MB, 238 cols)
|
||||
- **presidential_elections/** (190 files, 13.4 MB, 325 cols)
|
||||
|
||||
## world_ampas_oscar/ (1 tables, 16.7 KB, 1 files)
|
||||
|
||||
- **winner_demographics/** (1 files, 16.7 KB, 10 cols)
|
||||
|
||||
## world_iea_pirls/ (8 tables, 310.5 MB, 8 files)
|
||||
|
||||
- **dictionary/** (1 files, 27.0 KB, 5 cols)
|
||||
- **home_context/** (1 files, 9.7 MB, 120 cols)
|
||||
- **school_context/** (1 files, 680.6 KB, 103 cols)
|
||||
- **student_achievement/** (1 files, 106.5 MB, 864 cols)
|
||||
- **student_context/** (1 files, 104.2 MB, 157 cols)
|
||||
- **student_teacher_link/** (1 files, 81.1 MB, 51 cols)
|
||||
- **teacher_context/** (1 files, 1003.3 KB, 186 cols)
|
||||
- **within_country_scoring_reliability/** (1 files, 7.2 MB, 1057 cols)
|
||||
|
||||
## world_iea_timss/ (11 tables, 484.4 MB, 11 files)
|
||||
|
||||
- **dictionary/** (1 files, 15.8 KB, 5 cols)
|
||||
- **home_context_grade_4/** (1 files, 8.9 MB, 114 cols)
|
||||
- **school_context_grade_4/** (1 files, 672.5 KB, 111 cols)
|
||||
- **school_context_grade_8/** (1 files, 478.4 KB, 103 cols)
|
||||
- **student_achievement_grade_4/** (1 files, 236.3 MB, 110 cols)
|
||||
- **student_achievement_grade_8/** (1 files, 213.0 MB, 135 cols)
|
||||
- **student_context_grade_4/** (1 files, 13.0 MB, 127 cols)
|
||||
- **student_context_grade_8/** (1 files, 9.4 MB, 115 cols)
|
||||
- **teacher_context_grade_4/** (1 files, 680.6 KB, 87 cols)
|
||||
- **teacher_mathematics_grade_8/** (1 files, 627.1 KB, 168 cols)
|
||||
- **teacher_science_grade_8/** (1 files, 1.3 MB, 217 cols)
|
||||
|
||||
## world_imdb_movies/ (1 tables, 3.7 MB, 1 files)
|
||||
|
||||
- **top_movies_per_year/** (1 files, 3.7 MB, 23 cols)
|
||||
|
||||
## world_oecd_pisa/ (1 tables, 742.4 MB, 28 files)
|
||||
|
||||
- **student/** (28 files, 742.4 MB, 250 cols)
|
||||
|
||||
## world_oecd_public_finance/ (1 tables, 709.9 KB, 1 files)
|
||||
|
||||
- **country/** (1 files, 709.9 KB, 163 cols)
|
||||
|
||||
## world_olympedia_olympics/ (6 tables, 23.5 MB, 6 files)
|
||||
|
||||
- **athlete_bio/** (1 files, 16.5 MB, 11 cols)
|
||||
- **athlete_event_result/** (1 files, 3.9 MB, 11 cols)
|
||||
- **country/** (1 files, 3.2 KB, 2 cols)
|
||||
- **game/** (1 files, 5.7 KB, 10 cols)
|
||||
- **game_medal_tally/** (1 files, 14.6 KB, 9 cols)
|
||||
- **result/** (1 files, 3.0 MB, 11 cols)
|
||||
|
||||
## world_sofascore_competicoes_futebol/ (2 tables, 1.4 MB, 27 files)
|
||||
|
||||
- **brasileirao_serie_a/** (9 files, 603.4 KB, 85 cols)
|
||||
- **uefa_champions_league/** (18 files, 811.5 KB, 85 cols)
|
||||
|
||||
## world_wb_mides/ (9 tables, 37.4 GB, 4645 files)
|
||||
|
||||
- **dicionario/** (1 files, 16.5 KB, 5 cols)
|
||||
- **empenho/** (1488 files, 13.1 GB, 25 cols)
|
||||
- **licitacao/** (22 files, 160.9 MB, 32 cols)
|
||||
- **licitacao_item/** (210 files, 2.3 GB, 24 cols)
|
||||
- **licitacao_participante/** (22 files, 100.5 MB, 17 cols)
|
||||
- **liquidacao/** (1346 files, 7.2 GB, 20 cols)
|
||||
- **orgao_unidade_gestora/** (1 files, 2.1 MB, 8 cols)
|
||||
- **pagamento/** (1522 files, 14.3 GB, 25 cols)
|
||||
- **relacionamentos/** (33 files, 132.7 MB, 5 cols)
|
||||
|
||||
## world_wwf_hydrosheds/ (3 tables, 9.1 GB, 735 files)
|
||||
|
||||
- **basins_atlas/** (279 files, 5.6 GB, 296 cols)
|
||||
- **lakes_atlas/** (132 files, 1.5 GB, 307 cols)
|
||||
- **rivers_atlas/** (324 files, 2.0 GB, 297 cols)
|
||||
|
||||
---
|
||||
**Total: 533 tables · 675.4 GB · 77885 parquet files**
|
||||
299
docs/patterns-audit.md
Normal file
299
docs/patterns-audit.md
Normal file
@@ -0,0 +1,299 @@
|
||||
# Pattern Audit — Robustness & False Positive Analysis
|
||||
|
||||
Deep audit of all 8 risk patterns. For each pattern: legal basis, threshold rationale, known false positive scenarios, data quality notes, and differences between the per-CNPJ (interactive) and batch (scan-all) implementations.
|
||||
|
||||
---
|
||||
|
||||
## US1 — Split Contracts Below Threshold (`split_contracts_below_threshold`)
|
||||
|
||||
### Legal basis
|
||||
**Fracionamento de licitação** is prohibited by:
|
||||
- Lei 8.666/1993, art. 23, §5º: "É vedada a utilização da modalidade 'convite' ou 'tomada de preços' [...] para parcelas de uma mesma obra ou serviço."
|
||||
- Lei 14.133/2021, art. 145: directly prohibits splitting to evade the mandatory bidding requirement.
|
||||
|
||||
### Threshold: year-dependent
|
||||
|
||||
| Period | Threshold | Legal basis |
|
||||
|---|---|---|
|
||||
| ≤ 2023 | R$ 17.600 | Decreto 9.412/2018 / Lei 8.666/93 art. 23, I, "a" |
|
||||
| 2024+ | R$ 57.912 | Decreto 11.871/2024 / Lei 14.133/2021 art. 75, I |
|
||||
|
||||
For 2023 data many contracts still ran under Lei 8.666/93 (both laws co-existed). From 2024 the threshold is R$57.912. Using a static R$17.600 for 2024+ data would miss the main fraud window (R$17k–R$57k per contract). **Fixed (iteration 7):** all three implementations compute the threshold from the query year.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Legitimate multi-item purchasing**: A supplier providing diverse small items (office supplies, food for canteen) legitimately generates many small contracts below threshold from the same agency. The `combined_value > threshold` guard reduces but doesn't eliminate this.
|
||||
2. **Recurring service contracts**: Monthly service fees (e.g., R$1.500/month cleaning) generate 12 contracts/year — correctly NOT flagged (combined = R$18.000 > threshold, count ≥ 3 in first 3 months).
|
||||
3. **Different sub-units**: The grouping uses `id_orgao_superior` (ministry level). A ministry with many sub-units contracting independently may not be splitting; they may have independent needs.
|
||||
|
||||
### Improvements applied
|
||||
- None structural. Filter `valor_inicial_compra > 0` prevents division issues.
|
||||
|
||||
### Known data quality issues
|
||||
- `data_assinatura_contrato` can be NULL for some older contracts. **`FORMAT_DATE` on NULL returns NULL — it does NOT exclude those rows.** Without a guard, all NULL-dated contracts from the same agency would be grouped together under a single `NULL` month bucket, potentially producing a false flag if ≥3 of them are below threshold with combined value > threshold. Fixed (iteration 5): all three implementations now include `AND data_assinatura_contrato IS NOT NULL` in the WHERE clause.
|
||||
- `valor_inicial_compra` vs `valor_final_compra`: we use `valor_inicial_compra` intentionally since splitting is defined by the contract as signed, not final.
|
||||
|
||||
### Improvements applied (iteration 5)
|
||||
- Added `AND data_assinatura_contrato IS NOT NULL` to WHERE clause in all three implementations to prevent NULL-date contracts from being grouped into a spurious `mes = NULL` bucket.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 8): `scan-all.ts` now includes `id_orgao_superior` in both SELECT and GROUP BY, matching `index.ts` and `scan-suspicious.ts`. Prevents theoretical merging of two distinct ministries sharing the same name.
|
||||
|
||||
---
|
||||
|
||||
## US2 — Contract Concentration (`contract_concentration`)
|
||||
|
||||
### Legal basis
|
||||
No specific legal prohibition, but **TCU** and **CGU** audit methodology treat >40% share of a single agency's budget as a prima facie risk indicator requiring justification.
|
||||
- Reference: CGU "Manual de Orientações para Análise de Risco em Compras Públicas" (2022), section 4.2.
|
||||
|
||||
### Thresholds
|
||||
- **40% share**: empirical; above this, competition is functionally absent for that agency.
|
||||
- **R$ 50.000 minimum agency total**: excludes micro-units (small local offices) where one purchase naturally dominates.
|
||||
- **R$ 10.000 minimum supplier spend** (new, iteration 2): excludes trivial cases like a company with R$21k of a R$50k agency = 42% but both numbers are small.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Specialized niches**: A sole provider of a specialized service (e.g., judicial translation, specific medical device) may legitimately dominate one agency's procurement. No CNAE-based filter exists.
|
||||
2. **Monopolistic markets**: Some goods/services have few suppliers by nature (utilities, telecommunications infrastructure).
|
||||
3. **Framework agreements**: A single framework contract can make one supplier appear to dominate even if bidding was competitive at framework establishment.
|
||||
|
||||
### Improvements applied
|
||||
- Added `CONCENTRATION_MIN_SUPPLIER_SPEND = 10_000` to batch query and `scan-suspicious.ts` (iteration 2).
|
||||
- Added `CONCENTRATION_MIN_SUPPLIER_SPEND` filter to `index.ts` `patternConcentration` HAVING clause (iteration 4 — was present in batch/scan-suspicious but missing from web UI).
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 4): `index.ts` HAVING clause now includes `supplier_spend >= CONCENTRATION_MIN_SUPPLIER_SPEND`.
|
||||
✅ Fixed (iteration 9): `scan-all.ts` and `scan-suspicious.ts` now group by `(id_orgao_superior, nome_orgao_superior)` in both the spend and ministry_total CTEs, joining on the composite key. All three implementations are consistent.
|
||||
|
||||
---
|
||||
|
||||
## US3 — Inexigibility Recurrence (`inexigibility_recurrence`)
|
||||
|
||||
### Legal basis
|
||||
**Inexigibilidade de licitação** (Lei 14.133/2021 art. 74; Lei 8.666/93 art. 25) is legal when competition is technically impossible (e.g., exclusive supplier, artistic performances). Abuse occurs when agencies use inexigibilidade repeatedly for the same supplier to avoid competitive bidding.
|
||||
- Reference: **TCU Acórdão 1.793/2011**: defines recurrent inexigibilidade as a risk indicator requiring documentation of technical exclusivity per contract.
|
||||
|
||||
### Threshold: 3 contracts per managing unit
|
||||
- Below 3: could be two legitimate sole-source needs in the same year.
|
||||
- At 3+: pattern suggests systematic routing of contracts to avoid bidding.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Legitimate exclusive suppliers**: Publishers (publishing rights), performing arts venues, specialized IT vendors with proprietary systems legitimately receive many inexigibilidade contracts.
|
||||
2. **Long-term technical partnerships**: An agency may have a multi-year framework with an exclusive technical partner, generating many inexigibilidade contracts each year.
|
||||
3. **Artistic/cultural organizations**: Museums, theaters, and orchestras commonly contract artists via inexigibilidade.
|
||||
|
||||
### Improvements applied (iteration 2)
|
||||
- **Batch + scan-suspicious**: Now groups by `id_unidade_gestora` (ID) + `nome_unidade_gestora` (name). Previously grouped by name only, risking merger of distinct units sharing a common name.
|
||||
- **Batch + scan-suspicious**: Added `valor_inicial_compra >= R$ 1.000` filter. Micro-value contracts (< R$1k) rarely represent real abuse.
|
||||
|
||||
### Improvements applied (iteration 4)
|
||||
- **`index.ts`**: Added `AND valor_inicial_compra >= @min_value` to WHERE clause of `patternInexigibility`. The web UI was missing this filter, causing micro-value contracts to inflate the count and trigger false flags.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 4): all three implementations now filter `valor_inicial_compra >= R$ 1.000` and group by `id_unidade_gestora`.
|
||||
|
||||
---
|
||||
|
||||
## US4 — Single Bidder (`single_bidder`)
|
||||
|
||||
### Legal basis
|
||||
Not inherently illegal, but flagged by:
|
||||
- **Open Contracting Partnership "73 Red Flags" (2024)**, Flag #1: "Only one bid received."
|
||||
- CGU "Programa de Fiscalização em Entes Federativos" 2023: single-bidder rate >30% is a tier-1 risk indicator.
|
||||
|
||||
### Threshold: 2 occurrences
|
||||
- Intentionally low. Even one solo-bid win warrants investigation context. Two is the minimum pattern.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Specialized markets**: Satellite communications, nuclear materials, specialized medical devices — few vendors exist globally.
|
||||
2. **Geographic isolation**: Remote municipalities with limited local suppliers naturally attract few bidders even for standard goods.
|
||||
3. **Poorly timed notices**: Short bid windows or holiday periods reduce participation regardless of market structure.
|
||||
|
||||
### SQL robustness notes
|
||||
- Per-CNPJ: uses `STARTS_WITH(REGEXP_REPLACE(...), @cnpj)` — this matches any CNPJ where the base 8 digits match, including subsidiaries/branches. This is intentional: a corporate group that operates through multiple CNPJs should still surface.
|
||||
- Batch: uses `MAX(IF(vencedor AND LENGTH(...) = 14, SUBSTR(...), NULL))` to extract the winner's CNPJ from the `auction_stats` CTE. The `LENGTH = 14` guard in the `IF` condition ensures CPF winners don't produce invalid 8-digit keys. If two CNPJ rows have `vencedor=true` for the same auction (data quality issue), `MAX` picks lexicographically last — acceptable for batch purposes.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 8): **batch now counts ALL participants** (CPF + CNPJ) for `total_bidders`, matching per-CNPJ behavior. Previously, `LENGTH = 14` excluded CPF individuals from the count, causing the batch to over-flag auctions where a CPF participant was present. The `LENGTH = 14` guard is now applied only inside the `winner_cnpj` extraction `IF()` condition — not to the overall participant count.
|
||||
|
||||
---
|
||||
|
||||
## US5 — Always Winner (`always_winner`)
|
||||
|
||||
### Legal basis
|
||||
Not illegal per se, but high win rates in competitive auctions indicate possible:
|
||||
- Bid rigging (Lei 12.529/2011 art. 36, IV)
|
||||
- Tailored specifications (Lei 14.133/2021 art. 9, I)
|
||||
- Reference: **OCDE "Guidelines for Fighting Bid Rigging in Public Procurement" (2021)**
|
||||
|
||||
### Thresholds
|
||||
- **≥80% win rate** (per-CNPJ, fixed) — raised from 60% to reduce false positives. Batch uses dynamic Q3 (empirically ≈100% in this dataset).
|
||||
- **≥10 competitive participations** — minimum sample for statistical significance. Aligns batch and per-CNPJ.
|
||||
- **Competitive auctions only (≥2 bidders)** — critical to avoid overlap with US4.
|
||||
|
||||
### Critical fix applied (iteration 2)
|
||||
**The per-CNPJ version was NOT filtering for competitive auctions before this iteration.** A company that always won because it was always the only bidder would be flagged by both US4 (single_bidder) AND US5 (always_winner) — misleading double-counting. Fixed by adding a `competitive_auctions` CTE that filters `COUNT(1) >= 2`.
|
||||
|
||||
### Win rate distribution note
|
||||
The `licitacao_participante` dataset is **strongly bimodal**: approximately 33% of companies with ≥10 competitive participations have a perfect 100% win rate. The distribution does not follow a normal or uniform pattern. Q3 ≈ 1.0 regardless of the minimum sample cutoff (tested at 5, 10, 20). The dynamic Q3 threshold therefore flags only **perfect-win companies** — intentionally strict. This is documented in the spec.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Fixed (iteration 2): both now filter for competitive auctions. Batch uses dynamic Q3; per-CNPJ uses fixed 0.80 threshold. The fixed threshold produces a slightly broader result set on the interactive page, which is acceptable — the batch feed should be conservative; per-CNPJ investigation mode can be more sensitive.
|
||||
|
||||
---
|
||||
|
||||
## US6 — Amendment Inflation (`amendment_inflation`)
|
||||
|
||||
### Legal basis
|
||||
**Lei 14.133/2021 art. 125 §1º**: amendments may not increase the contract value by more than 25% of the original (for goods/services) or 50% (for construction). Inflation ≥ 1.25× means the contract **reached or exceeded its legal ceiling**.
|
||||
|
||||
### Threshold: 1.25× (25% above original)
|
||||
- Exactly the legal maximum. Contracts at 1.25× are at the legal limit; contracts above are potentially illegal unless specific circumstances apply (art. 125 §2º exceptions).
|
||||
|
||||
### False positive scenarios
|
||||
1. **Lawful exceptional amendments**: Art. 125 §2º allows exceeding 25% for "additional work indispensable to the object's completion" — requires specific administrative justification.
|
||||
2. **Construction contracts**: Legal ceiling is 50% (not 25%). Our threshold of 1.25× flags construction contracts that are within the legal limit.
|
||||
3. **Value adjustment clauses**: Contracts with inflation adjustment clauses (INPC/IPCA) can legitimately reach or exceed 1.25× over multi-year terms without any amendment.
|
||||
4. **Data entry errors**: Some `valor_final_compra` values are clearly data quality issues (e.g., 100× original).
|
||||
|
||||
### Improvements applied (iteration 3)
|
||||
- **Cap `inflation_ratio` at 10×** (`AMENDMENT_MAX_INFLATION_RATIO = 10.0`): ratios above this threshold are almost certainly data entry errors (e.g., `valor_final_compra` entered in a different unit) and would distort `total_excess` reporting. Applied to all three implementations via `AND ... <= @max_ratio` filter in SQL. Applied in `index.ts`, `scan-all.ts`, `scan-suspicious.ts`.
|
||||
|
||||
### Schema verification: construction vs goods/services threshold
|
||||
Lei 14.133/2021 art.125 §1º allows 50% amendments for engineering works vs 25% for goods/services.
|
||||
|
||||
**Column verified (schema dump):** `contrato_compra` has `id_modalidade_licitacao` (code) and `modalidade_licitacao` (name). However, this column encodes **bidding modality** (Concorrência, Pregão Eletrônico, Tomada de Preços, etc.) — not contract category (obras vs bens/serviços). There is no `tipo_contrato` or `categoria` column in the accessible schema.
|
||||
|
||||
### Improvements applied (iteration 8): construction keyword detection
|
||||
All three implementations now apply `IF(REGEXP_CONTAINS(LOWER(IFNULL(objeto, '')), r'obra|constru|reform|engenhari|paviment|demoli'), 1.50, 1.25)` to select the applicable legal threshold per contract. This reduces false positives for legitimate construction/engineering amendments that fall between 1.25× and 1.50×.
|
||||
|
||||
**Keywords and rationale:**
|
||||
| Keyword | Matches | Rationale |
|
||||
|---------|---------|-----------|
|
||||
| `obra` | obra, obras | General construction work |
|
||||
| `constru` | construção, construir | Building/construction |
|
||||
| `reform` | reforma, reformar, reformas | Renovation/remodeling |
|
||||
| `engenhari` | engenharia, engenheiro | Engineering services |
|
||||
| `paviment` | pavimentação, pavimento | Road/floor paving |
|
||||
| `demoli` | demolição, demolir | Demolition |
|
||||
|
||||
**Known limitations:** The `objeto` field is free-text entered by procurement officers. Some construction contracts may use generic descriptions ("serviços de manutenção") and be missed by this detection — applying the 1.25× threshold is safe for those (conservative false positive vs missed construction exemption).
|
||||
|
||||
### Improvements applied (iteration 9): constructionCount field
|
||||
`AmendmentInflationFlag` now includes `constructionCount`: the number of flagged contracts that matched the construction keywords and were therefore evaluated at the 1.50× threshold. The UI card shows this count with a tooltip explaining the applicable legal ceiling. This helps analysts distinguish "inflated by >25% on goods (potentially illegal)" from "inflated by >50% on obras (definitely exceeds even the construction ceiling)."
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
⚠️ Minor divergence (accepted): `index.ts` includes the aditivos CTE (`zeroAmendmentCount`) and `constructionCount` from `is_construction`. The batch scanners do NOT include these — `contrato_termo_aditivo` full scan is too expensive in batch, and `constructionCount` is per-row info not aggregable without the row-level data. Both fields are only available in the web UI's per-CNPJ output.
|
||||
|
||||
---
|
||||
|
||||
## US7 — Newborn Company (`newborn_company`)
|
||||
|
||||
### Legal basis
|
||||
No specific prohibition, but:
|
||||
- **Lei 14.133/2021 art. 68, I**: suppliers must demonstrate technical and economic qualification. Newly incorporated companies rarely can.
|
||||
- CGU "Guia Prático de Análise de Empresas de Fachada" (2021): age < 6 months at contract signing is a tier-1 indicator of possible shell company.
|
||||
|
||||
### Thresholds
|
||||
- **180 days** (6 months): practical minimum for legitimate operational readiness.
|
||||
- **R$ 50.000 minimum contract value**: excludes training contracts and small acquisitions where new companies are common and low-risk.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Spinoffs and restructurings**: A newly incorporated CNPJ may be a restructured entity of an existing business with full operational capacity.
|
||||
2. **Holding company structures**: A holding created to receive a specific contract may have the technical capacity of its parent, not its founding date.
|
||||
3. **Startups in innovation programs**: Government startup accelerator programs (e.g., FAPESP TT, EMBRAPII) specifically contract very new companies.
|
||||
4. **`data_inicio_atividade` from establishments**: The founding date comes from `br_me_cnpj.estabelecimentos`, not `empresas`. Branches opened after the headquarter can make an established company appear "newborn" in a specific municipality.
|
||||
|
||||
### Data quality note
|
||||
`data_inicio_atividade` lives in `br_me_cnpj.estabelecimentos`, NOT `empresas`. The query uses `MIN(est.data_inicio_atividade)` across all establishments for the same `cnpj_basico` — this correctly picks the earliest known opening date, reducing the false positive of branches.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Equivalent. Both use `MIN(data_inicio_atividade)` across establishments with `ano=2023 AND mes=12`.
|
||||
|
||||
⚠️ **Known necessary full-table scan**: The `first_contract` CTE in `batchNewborn` (`scan-all.ts`) intentionally omits an `ano` filter on `contrato_compra`:
|
||||
```sql
|
||||
FROM `basedosdados.br_cgu_licitacao_contrato.contrato_compra`
|
||||
WHERE LENGTH(REGEXP_REPLACE(cpf_cnpj_contratado, r'\D', '')) = 14
|
||||
AND valor_final_compra >= <MIN_VALUE>
|
||||
GROUP BY cnpj_basico
|
||||
```
|
||||
This is a deliberate exception to the "zero full-table scans" rule from the spec. The pattern asks: *"did this company win its very first contract within 180 days of founding?"* Restricting to `ano = ANO` would miss the true first contract if it occurred in an earlier year — producing a false negative. The `founding` CTE correctly filters `e.ano = ANO AND est.ano = ANO AND est.mes = 12`. Only `first_contract` scans all years, but the `LENGTH = 14` CPF exclusion and `valor_final_compra >= R$ 50k` filter significantly reduce bytes scanned.
|
||||
|
||||
---
|
||||
|
||||
## US8 — Sudden Surge (`sudden_surge`)
|
||||
|
||||
### Legal basis
|
||||
Not illegal, but flagged by:
|
||||
- **UNODC "Guidebook on anti-corruption in public procurement" (2013)**: "Sudden large increase in a company's public contract revenue" is a tier-2 risk indicator.
|
||||
- TCU Acórdão 2.622/2015: large YoY procurement increases without prior procurement history warrant scrutiny.
|
||||
|
||||
### Thresholds
|
||||
- **5× YoY growth**: chosen to exclude normal business growth (2-3×) while flagging exponential jumps.
|
||||
- **R$ 1.000.000 minimum**: a 5× jump from R$200k to R$1M is meaningful; from R$10k to R$50k is noise.
|
||||
- **4-year lookback**: captures context before the surge.
|
||||
|
||||
### False positive scenarios
|
||||
1. **Post-restructuring recovery**: A company that was inactive for 2 years then resumed full operations would appear to surge.
|
||||
2. **New framework agreements**: Being added to a large framework agreement in year N can produce apparent surge with no underlying change in the company.
|
||||
3. **Government budget cycles**: Some sectors receive large multi-year contracts every 4 years (e.g., IT system replacements) creating apparent surges.
|
||||
|
||||
### SQL robustness note
|
||||
Both per-CNPJ and batch use `prev_v > 0` guard to exclude zero→nonzero transitions (handled by US7 newborn_company instead). The batch uses `LAG` window function; per-CNPJ iterates over the history array client-side.
|
||||
|
||||
**Consecutive-year guard (iteration 6):** The spec says `value[year_N] / value[year_N-1]`. Without a guard, `LAG` compares any adjacent rows in sorted order — if a company had data in 2019 and 2023 (dormant 2020–2022), the comparison spans 4 years and produces a false surge. Fixed by:
|
||||
- `scan-all.ts`: added `LAG(ano)` alongside `LAG(v)` and `WHERE ano - prev_ano = 1`
|
||||
- `index.ts`, `scan-suspicious.ts`: added `curr.ano - prev.ano === 1` to the JS loop condition
|
||||
|
||||
**false positive (false negative from audit):** The first false positive scenario (post-restructuring recovery) is now LESS likely to trigger since the consecutive-year guard would catch companies dormant for ≥1 year.
|
||||
|
||||
The per-CNPJ implementation reports only the **first** qualifying surge year (breaks on first hit). If a company surged twice, only the earlier event is shown. This is conservative.
|
||||
|
||||
### Per-CNPJ vs batch consistency
|
||||
✅ Equivalent. Batch uses SQL `LAG`; per-CNPJ uses JS loop. Both find the first qualifying year.
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Issue: Cache Miss vs Stored Null
|
||||
|
||||
### Bug 1: Cache Miss vs Stored Null (fixed iteration 6)
|
||||
|
||||
`cache.ts` `getCache` was returning `null` for both cache misses (file not found) and legitimately stored null values (pattern found nothing). Patterns US4–US8 and the company lookup all use `null` as their "nothing found" sentinel and check `cached !== undefined` to skip re-querying. With the old `getCache` returning `null` on miss, `null !== undefined` evaluated to `true`, causing the BigQuery query to be skipped permanently — US4–US8 would never execute on a CNPJ not yet in cache.
|
||||
|
||||
**Fix:** `getCache` now returns `undefined` on miss or expiry; returns `T` (including `null`) on a valid cache hit. The company-lookup caller that used `!== null` was updated to `!== undefined`.
|
||||
|
||||
### Bug 2: Falsy cache check for array-returning patterns (fixed iteration 7)
|
||||
|
||||
US1, US2, US3, and `runPatterns()` in `index.ts` used `if (cached) return cached` to check for cache hits. An empty array `[]` is **falsy** in JavaScript — so a cached "no flags found" result (a real cache hit) was silently discarded, causing BigQuery to be re-queried on every subsequent call for clean CNPJs.
|
||||
|
||||
Affected: `patternSplitContracts`, `patternConcentration`, `patternInexigibility`, `runPatterns`.
|
||||
|
||||
**Fix:** changed all four to `if (cached !== undefined) return cached`. (US4–US8 already used this pattern since they cache `null` as "nothing found" — they were correct.)
|
||||
|
||||
---
|
||||
|
||||
## Cross-Pattern Issues
|
||||
|
||||
### Overlap between US4 and US5
|
||||
- **Before iteration 2**: US5 per-CNPJ would flag solo-bid winners as "always winner", creating confusing double flags.
|
||||
- **After iteration 2**: US5 filters to competitive auctions only. A pure solo-bid company gets US4 only; a company that wins competitive auctions at high rates gets US5 only; both behaviors together get both flags independently.
|
||||
|
||||
### Overlap between US7 and US8
|
||||
- A newborn company with a sudden surge would be flagged by both US7 (age at contract) and US8 (YoY growth). This is intentional and additive — both signals reinforce each other.
|
||||
|
||||
### CNPJ matching strategy
|
||||
All patterns use `cnpj_basico` (8-digit root) as the joining key. This means **all branches and subsidiaries** of a corporate group are attributed to the same `cnpj_basico`. This can create false positives for large corporations with many legitimate establishments (e.g., Correios, Petrobras) that naturally have contracts across many agencies.
|
||||
|
||||
---
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Pattern | FP Risk | Legal Basis | Fixes Applied |
|
||||
|---------|---------|------------|---------------|
|
||||
| US1 Split | Medium — multi-item purchasing | Decreto 9.412/2018 / Decreto 11.871/2024 | NULL date guard; year-dependent threshold (R$17.600 ≤2023, R$57.912 2024+); falsy cache check fixed; **batch GROUP BY now includes id_orgao_superior** |
|
||||
| US2 Concentration | Medium — specialized markets | CGU 2022 methodology | Added min supplier spend to all 3 implementations; **falsy cache check fixed**; **all 3 now GROUP BY (id+name) — no ministry-name collision** |
|
||||
| US3 Inexigibility | High — legitimate exclusive suppliers | TCU Acórdão 1.793/2011 | Fixed grouping by ID; added min value to all 3 implementations; **falsy cache check fixed** |
|
||||
| US4 Single Bidder | Medium — specialized/remote markets | OCP 2024 Flag #1 | **cache.ts bug fixed** (getCache null-vs-undefined); **batch now counts all participants (CPF+CNPJ)** — consistent with per-CNPJ |
|
||||
| US5 Always Winner | **Was HIGH** (no competitive filter) → Now Medium | OCDE 2021 | Fixed: competitive auctions only; raised thresholds; **cache.ts bug fixed** |
|
||||
| US6 Amendment | Medium — inflation clauses | Lei 14.133/2021 art.125 | Added 10× inflation cap; **cache.ts bug fixed**; **construction keyword detection: 1.50× threshold for obras/etc.**; **constructionCount in UI flag** |
|
||||
| US7 Newborn | High — spinoffs, restructurings | CGU 2021 guide | **cache.ts bug fixed** (was never querying BigQuery on cache miss) |
|
||||
| US8 Surge | Medium — framework agreements, budget cycles | UNODC 2013 | Added consecutive-year guard; **cache.ts bug fixed** |
|
||||
147562
docs/schemas.json
Normal file
147562
docs/schemas.json
Normal file
File diff suppressed because one or more lines are too long
BIN
docs/wordcloud_attributes.png
Normal file
BIN
docs/wordcloud_attributes.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1.8 MiB |
45
docs/wordcloud_attributes.py
Normal file
45
docs/wordcloud_attributes.py
Normal file
@@ -0,0 +1,45 @@
|
||||
#!/usr/bin/env python3
|
||||
import json
|
||||
import re
|
||||
from collections import Counter
|
||||
from wordcloud import WordCloud
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
STOPWORDS = {'de', 'do', 'da', 'a', 'ou', 'em', 'e', 'o', 'que', 'das', 'dos', 'nos', 'nas', 'um', 'uma', 'para', 'com', 'não', 'uma', 'à', 'ao', 'os', 'as', 'se', 'na', 'no', 'de', 'do', 'da', 'é', 'ser', 'seu', 'sua', 'isso', 'the', 'of', 'and', 'in', 'to', 'is', 'for', 'on', 'with', 'at', 'by', 'from'}
|
||||
|
||||
with open('context/basedosdados-schema.json') as f:
|
||||
schema = json.load(f)
|
||||
|
||||
words = []
|
||||
for dataset, tables in schema.items():
|
||||
for table, cols in tables.items():
|
||||
for col in cols:
|
||||
name = col.get('name', '').lower()
|
||||
desc = col.get('description', '').lower()
|
||||
if name and len(name) >= 3:
|
||||
words.append(name)
|
||||
if desc:
|
||||
for w in desc.split():
|
||||
w = re.sub(r'[^a-záàâãéèêíìîóòôõúùûç]', '', w)
|
||||
if len(w) >= 3 and w not in STOPWORDS:
|
||||
words.append(w)
|
||||
|
||||
word_freq = Counter(words)
|
||||
|
||||
wc = WordCloud(
|
||||
width=1600,
|
||||
height=800,
|
||||
background_color='white',
|
||||
max_words=200,
|
||||
colormap='viridis',
|
||||
min_font_size=8
|
||||
).generate_from_frequencies(word_freq)
|
||||
|
||||
plt.figure(figsize=(20, 10))
|
||||
plt.imshow(wc, interpolation='bilinear')
|
||||
plt.axis('off')
|
||||
plt.tight_layout(pad=0)
|
||||
plt.savefig('docs/wordcloud_attributes.png', dpi=150, bbox_inches='tight')
|
||||
print("Saved docs/wordcloud_attributes.png")
|
||||
print(f"Total unique words: {len(word_freq)}")
|
||||
print("Top 30:", word_freq.most_common(30))
|
||||
BIN
docs/wordcloud_datasets.png
Normal file
BIN
docs/wordcloud_datasets.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1.3 MiB |
33
docs/wordcloud_datasets.py
Normal file
33
docs/wordcloud_datasets.py
Normal file
@@ -0,0 +1,33 @@
|
||||
#!/usr/bin/env python3
|
||||
import json
|
||||
from collections import Counter
|
||||
from wordcloud import WordCloud
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
with open('context/basedosdados-schema.json') as f:
|
||||
schema = json.load(f)
|
||||
|
||||
dataset_names = []
|
||||
for dataset in schema.keys():
|
||||
parts = dataset.replace('br_', '').replace('mundo_', '').replace('eu_', '').split('_')
|
||||
dataset_names.extend([p for p in parts if len(p) >= 3])
|
||||
|
||||
word_freq = Counter(dataset_names)
|
||||
|
||||
wc = WordCloud(
|
||||
width=1600,
|
||||
height=800,
|
||||
background_color='white',
|
||||
max_words=100,
|
||||
colormap='plasma',
|
||||
min_font_size=10
|
||||
).generate_from_frequencies(word_freq)
|
||||
|
||||
plt.figure(figsize=(20, 10))
|
||||
plt.imshow(wc, interpolation='bilinear')
|
||||
plt.axis('off')
|
||||
plt.tight_layout(pad=0)
|
||||
plt.savefig('docs/wordcloud_datasets.png', dpi=150, bbox_inches='tight')
|
||||
print("Saved docs/wordcloud_datasets.png")
|
||||
print(f"Total unique words: {len(word_freq)}")
|
||||
print("Top 30:", word_freq.most_common(30))
|
||||
Reference in New Issue
Block a user