24 KiB
Pattern Audit — Robustness & False Positive Analysis
Deep audit of all 8 risk patterns. For each pattern: legal basis, threshold rationale, known false positive scenarios, data quality notes, and differences between the per-CNPJ (interactive) and batch (scan-all) implementations.
US1 — Split Contracts Below Threshold (split_contracts_below_threshold)
Legal basis
Fracionamento de licitação is prohibited by:
- Lei 8.666/1993, art. 23, §5º: "É vedada a utilização da modalidade 'convite' ou 'tomada de preços' [...] para parcelas de uma mesma obra ou serviço."
- Lei 14.133/2021, art. 145: directly prohibits splitting to evade the mandatory bidding requirement.
Threshold: year-dependent
| Period | Threshold | Legal basis |
|---|---|---|
| ≤ 2023 | R$ 17.600 | Decreto 9.412/2018 / Lei 8.666/93 art. 23, I, "a" |
| 2024+ | R$ 57.912 | Decreto 11.871/2024 / Lei 14.133/2021 art. 75, I |
For 2023 data many contracts still ran under Lei 8.666/93 (both laws co-existed). From 2024 the threshold is R$57.912. Using a static R$17.600 for 2024+ data would miss the main fraud window (R$17k–R$57k per contract). Fixed (iteration 7): all three implementations compute the threshold from the query year.
False positive scenarios
- Legitimate multi-item purchasing: A supplier providing diverse small items (office supplies, food for canteen) legitimately generates many small contracts below threshold from the same agency. The
combined_value > thresholdguard reduces but doesn't eliminate this. - Recurring service contracts: Monthly service fees (e.g., R$1.500/month cleaning) generate 12 contracts/year — correctly NOT flagged (combined = R$18.000 > threshold, count ≥ 3 in first 3 months).
- Different sub-units: The grouping uses
id_orgao_superior(ministry level). A ministry with many sub-units contracting independently may not be splitting; they may have independent needs.
Improvements applied
- None structural. Filter
valor_inicial_compra > 0prevents division issues.
Known data quality issues
data_assinatura_contratocan be NULL for some older contracts.FORMAT_DATEon NULL returns NULL — it does NOT exclude those rows. Without a guard, all NULL-dated contracts from the same agency would be grouped together under a singleNULLmonth bucket, potentially producing a false flag if ≥3 of them are below threshold with combined value > threshold. Fixed (iteration 5): all three implementations now includeAND data_assinatura_contrato IS NOT NULLin the WHERE clause.valor_inicial_compravsvalor_final_compra: we usevalor_inicial_compraintentionally since splitting is defined by the contract as signed, not final.
Improvements applied (iteration 5)
- Added
AND data_assinatura_contrato IS NOT NULLto WHERE clause in all three implementations to prevent NULL-date contracts from being grouped into a spuriousmes = NULLbucket.
Per-CNPJ vs batch consistency
✅ Fixed (iteration 8): scan-all.ts now includes id_orgao_superior in both SELECT and GROUP BY, matching index.ts and scan-suspicious.ts. Prevents theoretical merging of two distinct ministries sharing the same name.
US2 — Contract Concentration (contract_concentration)
Legal basis
No specific legal prohibition, but TCU and CGU audit methodology treat >40% share of a single agency's budget as a prima facie risk indicator requiring justification.
- Reference: CGU "Manual de Orientações para Análise de Risco em Compras Públicas" (2022), section 4.2.
Thresholds
- 40% share: empirical; above this, competition is functionally absent for that agency.
- R$ 50.000 minimum agency total: excludes micro-units (small local offices) where one purchase naturally dominates.
- R$ 10.000 minimum supplier spend (new, iteration 2): excludes trivial cases like a company with R$21k of a R$50k agency = 42% but both numbers are small.
False positive scenarios
- Specialized niches: A sole provider of a specialized service (e.g., judicial translation, specific medical device) may legitimately dominate one agency's procurement. No CNAE-based filter exists.
- Monopolistic markets: Some goods/services have few suppliers by nature (utilities, telecommunications infrastructure).
- Framework agreements: A single framework contract can make one supplier appear to dominate even if bidding was competitive at framework establishment.
Improvements applied
- Added
CONCENTRATION_MIN_SUPPLIER_SPEND = 10_000to batch query andscan-suspicious.ts(iteration 2). - Added
CONCENTRATION_MIN_SUPPLIER_SPENDfilter toindex.tspatternConcentrationHAVING clause (iteration 4 — was present in batch/scan-suspicious but missing from web UI).
Per-CNPJ vs batch consistency
✅ Fixed (iteration 4): index.ts HAVING clause now includes supplier_spend >= CONCENTRATION_MIN_SUPPLIER_SPEND.
✅ Fixed (iteration 9): scan-all.ts and scan-suspicious.ts now group by (id_orgao_superior, nome_orgao_superior) in both the spend and ministry_total CTEs, joining on the composite key. All three implementations are consistent.
US3 — Inexigibility Recurrence (inexigibility_recurrence)
Legal basis
Inexigibilidade de licitação (Lei 14.133/2021 art. 74; Lei 8.666/93 art. 25) is legal when competition is technically impossible (e.g., exclusive supplier, artistic performances). Abuse occurs when agencies use inexigibilidade repeatedly for the same supplier to avoid competitive bidding.
- Reference: TCU Acórdão 1.793/2011: defines recurrent inexigibilidade as a risk indicator requiring documentation of technical exclusivity per contract.
Threshold: 3 contracts per managing unit
- Below 3: could be two legitimate sole-source needs in the same year.
- At 3+: pattern suggests systematic routing of contracts to avoid bidding.
False positive scenarios
- Legitimate exclusive suppliers: Publishers (publishing rights), performing arts venues, specialized IT vendors with proprietary systems legitimately receive many inexigibilidade contracts.
- Long-term technical partnerships: An agency may have a multi-year framework with an exclusive technical partner, generating many inexigibilidade contracts each year.
- Artistic/cultural organizations: Museums, theaters, and orchestras commonly contract artists via inexigibilidade.
Improvements applied (iteration 2)
- Batch + scan-suspicious: Now groups by
id_unidade_gestora(ID) +nome_unidade_gestora(name). Previously grouped by name only, risking merger of distinct units sharing a common name. - Batch + scan-suspicious: Added
valor_inicial_compra >= R$ 1.000filter. Micro-value contracts (< R$1k) rarely represent real abuse.
Improvements applied (iteration 4)
index.ts: AddedAND valor_inicial_compra >= @min_valueto WHERE clause ofpatternInexigibility. The web UI was missing this filter, causing micro-value contracts to inflate the count and trigger false flags.
Per-CNPJ vs batch consistency
✅ Fixed (iteration 4): all three implementations now filter valor_inicial_compra >= R$ 1.000 and group by id_unidade_gestora.
US4 — Single Bidder (single_bidder)
Legal basis
Not inherently illegal, but flagged by:
- Open Contracting Partnership "73 Red Flags" (2024), Flag #1: "Only one bid received."
- CGU "Programa de Fiscalização em Entes Federativos" 2023: single-bidder rate >30% is a tier-1 risk indicator.
Threshold: 2 occurrences
- Intentionally low. Even one solo-bid win warrants investigation context. Two is the minimum pattern.
False positive scenarios
- Specialized markets: Satellite communications, nuclear materials, specialized medical devices — few vendors exist globally.
- Geographic isolation: Remote municipalities with limited local suppliers naturally attract few bidders even for standard goods.
- Poorly timed notices: Short bid windows or holiday periods reduce participation regardless of market structure.
SQL robustness notes
- Per-CNPJ: uses
STARTS_WITH(REGEXP_REPLACE(...), @cnpj)— this matches any CNPJ where the base 8 digits match, including subsidiaries/branches. This is intentional: a corporate group that operates through multiple CNPJs should still surface. - Batch: uses
MAX(IF(vencedor AND LENGTH(...) = 14, SUBSTR(...), NULL))to extract the winner's CNPJ from theauction_statsCTE. TheLENGTH = 14guard in theIFcondition ensures CPF winners don't produce invalid 8-digit keys. If two CNPJ rows havevencedor=truefor the same auction (data quality issue),MAXpicks lexicographically last — acceptable for batch purposes.
Per-CNPJ vs batch consistency
✅ Fixed (iteration 8): batch now counts ALL participants (CPF + CNPJ) for total_bidders, matching per-CNPJ behavior. Previously, LENGTH = 14 excluded CPF individuals from the count, causing the batch to over-flag auctions where a CPF participant was present. The LENGTH = 14 guard is now applied only inside the winner_cnpj extraction IF() condition — not to the overall participant count.
US5 — Always Winner (always_winner)
Legal basis
Not illegal per se, but high win rates in competitive auctions indicate possible:
- Bid rigging (Lei 12.529/2011 art. 36, IV)
- Tailored specifications (Lei 14.133/2021 art. 9, I)
- Reference: OCDE "Guidelines for Fighting Bid Rigging in Public Procurement" (2021)
Thresholds
- ≥80% win rate (per-CNPJ, fixed) — raised from 60% to reduce false positives. Batch uses dynamic Q3 (empirically ≈100% in this dataset).
- ≥10 competitive participations — minimum sample for statistical significance. Aligns batch and per-CNPJ.
- Competitive auctions only (≥2 bidders) — critical to avoid overlap with US4.
Critical fix applied (iteration 2)
The per-CNPJ version was NOT filtering for competitive auctions before this iteration. A company that always won because it was always the only bidder would be flagged by both US4 (single_bidder) AND US5 (always_winner) — misleading double-counting. Fixed by adding a competitive_auctions CTE that filters COUNT(1) >= 2.
Win rate distribution note
The licitacao_participante dataset is strongly bimodal: approximately 33% of companies with ≥10 competitive participations have a perfect 100% win rate. The distribution does not follow a normal or uniform pattern. Q3 ≈ 1.0 regardless of the minimum sample cutoff (tested at 5, 10, 20). The dynamic Q3 threshold therefore flags only perfect-win companies — intentionally strict. This is documented in the spec.
Per-CNPJ vs batch consistency
✅ Fixed (iteration 2): both now filter for competitive auctions. Batch uses dynamic Q3; per-CNPJ uses fixed 0.80 threshold. The fixed threshold produces a slightly broader result set on the interactive page, which is acceptable — the batch feed should be conservative; per-CNPJ investigation mode can be more sensitive.
US6 — Amendment Inflation (amendment_inflation)
Legal basis
Lei 14.133/2021 art. 125 §1º: amendments may not increase the contract value by more than 25% of the original (for goods/services) or 50% (for construction). Inflation ≥ 1.25× means the contract reached or exceeded its legal ceiling.
Threshold: 1.25× (25% above original)
- Exactly the legal maximum. Contracts at 1.25× are at the legal limit; contracts above are potentially illegal unless specific circumstances apply (art. 125 §2º exceptions).
False positive scenarios
- Lawful exceptional amendments: Art. 125 §2º allows exceeding 25% for "additional work indispensable to the object's completion" — requires specific administrative justification.
- Construction contracts: Legal ceiling is 50% (not 25%). Our threshold of 1.25× flags construction contracts that are within the legal limit.
- Value adjustment clauses: Contracts with inflation adjustment clauses (INPC/IPCA) can legitimately reach or exceed 1.25× over multi-year terms without any amendment.
- Data entry errors: Some
valor_final_compravalues are clearly data quality issues (e.g., 100× original).
Improvements applied (iteration 3)
- Cap
inflation_ratioat 10× (AMENDMENT_MAX_INFLATION_RATIO = 10.0): ratios above this threshold are almost certainly data entry errors (e.g.,valor_final_compraentered in a different unit) and would distorttotal_excessreporting. Applied to all three implementations viaAND ... <= @max_ratiofilter in SQL. Applied inindex.ts,scan-all.ts,scan-suspicious.ts.
Schema verification: construction vs goods/services threshold
Lei 14.133/2021 art.125 §1º allows 50% amendments for engineering works vs 25% for goods/services.
Column verified (schema dump): contrato_compra has id_modalidade_licitacao (code) and modalidade_licitacao (name). However, this column encodes bidding modality (Concorrência, Pregão Eletrônico, Tomada de Preços, etc.) — not contract category (obras vs bens/serviços). There is no tipo_contrato or categoria column in the accessible schema.
Improvements applied (iteration 8): construction keyword detection
All three implementations now apply IF(REGEXP_CONTAINS(LOWER(IFNULL(objeto, '')), r'obra|constru|reform|engenhari|paviment|demoli'), 1.50, 1.25) to select the applicable legal threshold per contract. This reduces false positives for legitimate construction/engineering amendments that fall between 1.25× and 1.50×.
Keywords and rationale:
| Keyword | Matches | Rationale |
|---|---|---|
obra |
obra, obras | General construction work |
constru |
construção, construir | Building/construction |
reform |
reforma, reformar, reformas | Renovation/remodeling |
engenhari |
engenharia, engenheiro | Engineering services |
paviment |
pavimentação, pavimento | Road/floor paving |
demoli |
demolição, demolir | Demolition |
Known limitations: The objeto field is free-text entered by procurement officers. Some construction contracts may use generic descriptions ("serviços de manutenção") and be missed by this detection — applying the 1.25× threshold is safe for those (conservative false positive vs missed construction exemption).
Improvements applied (iteration 9): constructionCount field
AmendmentInflationFlag now includes constructionCount: the number of flagged contracts that matched the construction keywords and were therefore evaluated at the 1.50× threshold. The UI card shows this count with a tooltip explaining the applicable legal ceiling. This helps analysts distinguish "inflated by >25% on goods (potentially illegal)" from "inflated by >50% on obras (definitely exceeds even the construction ceiling)."
Per-CNPJ vs batch consistency
⚠️ Minor divergence (accepted): index.ts includes the aditivos CTE (zeroAmendmentCount) and constructionCount from is_construction. The batch scanners do NOT include these — contrato_termo_aditivo full scan is too expensive in batch, and constructionCount is per-row info not aggregable without the row-level data. Both fields are only available in the web UI's per-CNPJ output.
US7 — Newborn Company (newborn_company)
Legal basis
No specific prohibition, but:
- Lei 14.133/2021 art. 68, I: suppliers must demonstrate technical and economic qualification. Newly incorporated companies rarely can.
- CGU "Guia Prático de Análise de Empresas de Fachada" (2021): age < 6 months at contract signing is a tier-1 indicator of possible shell company.
Thresholds
- 180 days (6 months): practical minimum for legitimate operational readiness.
- R$ 50.000 minimum contract value: excludes training contracts and small acquisitions where new companies are common and low-risk.
False positive scenarios
- Spinoffs and restructurings: A newly incorporated CNPJ may be a restructured entity of an existing business with full operational capacity.
- Holding company structures: A holding created to receive a specific contract may have the technical capacity of its parent, not its founding date.
- Startups in innovation programs: Government startup accelerator programs (e.g., FAPESP TT, EMBRAPII) specifically contract very new companies.
data_inicio_atividadefrom establishments: The founding date comes frombr_me_cnpj.estabelecimentos, notempresas. Branches opened after the headquarter can make an established company appear "newborn" in a specific municipality.
Data quality note
data_inicio_atividade lives in br_me_cnpj.estabelecimentos, NOT empresas. The query uses MIN(est.data_inicio_atividade) across all establishments for the same cnpj_basico — this correctly picks the earliest known opening date, reducing the false positive of branches.
Per-CNPJ vs batch consistency
✅ Equivalent. Both use MIN(data_inicio_atividade) across establishments with ano=2023 AND mes=12.
⚠️ Known necessary full-table scan: The first_contract CTE in batchNewborn (scan-all.ts) intentionally omits an ano filter on contrato_compra:
FROM `basedosdados.br_cgu_licitacao_contrato.contrato_compra`
WHERE LENGTH(REGEXP_REPLACE(cpf_cnpj_contratado, r'\D', '')) = 14
AND valor_final_compra >= <MIN_VALUE>
GROUP BY cnpj_basico
This is a deliberate exception to the "zero full-table scans" rule from the spec. The pattern asks: "did this company win its very first contract within 180 days of founding?" Restricting to ano = ANO would miss the true first contract if it occurred in an earlier year — producing a false negative. The founding CTE correctly filters e.ano = ANO AND est.ano = ANO AND est.mes = 12. Only first_contract scans all years, but the LENGTH = 14 CPF exclusion and valor_final_compra >= R$ 50k filter significantly reduce bytes scanned.
US8 — Sudden Surge (sudden_surge)
Legal basis
Not illegal, but flagged by:
- UNODC "Guidebook on anti-corruption in public procurement" (2013): "Sudden large increase in a company's public contract revenue" is a tier-2 risk indicator.
- TCU Acórdão 2.622/2015: large YoY procurement increases without prior procurement history warrant scrutiny.
Thresholds
- 5× YoY growth: chosen to exclude normal business growth (2-3×) while flagging exponential jumps.
- R$ 1.000.000 minimum: a 5× jump from R$200k to R$1M is meaningful; from R$10k to R$50k is noise.
- 4-year lookback: captures context before the surge.
False positive scenarios
- Post-restructuring recovery: A company that was inactive for 2 years then resumed full operations would appear to surge.
- New framework agreements: Being added to a large framework agreement in year N can produce apparent surge with no underlying change in the company.
- Government budget cycles: Some sectors receive large multi-year contracts every 4 years (e.g., IT system replacements) creating apparent surges.
SQL robustness note
Both per-CNPJ and batch use prev_v > 0 guard to exclude zero→nonzero transitions (handled by US7 newborn_company instead). The batch uses LAG window function; per-CNPJ iterates over the history array client-side.
Consecutive-year guard (iteration 6): The spec says value[year_N] / value[year_N-1]. Without a guard, LAG compares any adjacent rows in sorted order — if a company had data in 2019 and 2023 (dormant 2020–2022), the comparison spans 4 years and produces a false surge. Fixed by:
scan-all.ts: addedLAG(ano)alongsideLAG(v)andWHERE ano - prev_ano = 1index.ts,scan-suspicious.ts: addedcurr.ano - prev.ano === 1to the JS loop condition
false positive (false negative from audit): The first false positive scenario (post-restructuring recovery) is now LESS likely to trigger since the consecutive-year guard would catch companies dormant for ≥1 year.
The per-CNPJ implementation reports only the first qualifying surge year (breaks on first hit). If a company surged twice, only the earlier event is shown. This is conservative.
Per-CNPJ vs batch consistency
✅ Equivalent. Batch uses SQL LAG; per-CNPJ uses JS loop. Both find the first qualifying year.
Infrastructure Issue: Cache Miss vs Stored Null
Bug 1: Cache Miss vs Stored Null (fixed iteration 6)
cache.ts getCache was returning null for both cache misses (file not found) and legitimately stored null values (pattern found nothing). Patterns US4–US8 and the company lookup all use null as their "nothing found" sentinel and check cached !== undefined to skip re-querying. With the old getCache returning null on miss, null !== undefined evaluated to true, causing the BigQuery query to be skipped permanently — US4–US8 would never execute on a CNPJ not yet in cache.
Fix: getCache now returns undefined on miss or expiry; returns T (including null) on a valid cache hit. The company-lookup caller that used !== null was updated to !== undefined.
Bug 2: Falsy cache check for array-returning patterns (fixed iteration 7)
US1, US2, US3, and runPatterns() in index.ts used if (cached) return cached to check for cache hits. An empty array [] is falsy in JavaScript — so a cached "no flags found" result (a real cache hit) was silently discarded, causing BigQuery to be re-queried on every subsequent call for clean CNPJs.
Affected: patternSplitContracts, patternConcentration, patternInexigibility, runPatterns.
Fix: changed all four to if (cached !== undefined) return cached. (US4–US8 already used this pattern since they cache null as "nothing found" — they were correct.)
Cross-Pattern Issues
Overlap between US4 and US5
- Before iteration 2: US5 per-CNPJ would flag solo-bid winners as "always winner", creating confusing double flags.
- After iteration 2: US5 filters to competitive auctions only. A pure solo-bid company gets US4 only; a company that wins competitive auctions at high rates gets US5 only; both behaviors together get both flags independently.
Overlap between US7 and US8
- A newborn company with a sudden surge would be flagged by both US7 (age at contract) and US8 (YoY growth). This is intentional and additive — both signals reinforce each other.
CNPJ matching strategy
All patterns use cnpj_basico (8-digit root) as the joining key. This means all branches and subsidiaries of a corporate group are attributed to the same cnpj_basico. This can create false positives for large corporations with many legitimate establishments (e.g., Correios, Petrobras) that naturally have contracts across many agencies.
Summary Table
| Pattern | FP Risk | Legal Basis | Fixes Applied |
|---|---|---|---|
| US1 Split | Medium — multi-item purchasing | Decreto 9.412/2018 / Decreto 11.871/2024 | NULL date guard; year-dependent threshold (R$17.600 ≤2023, R$57.912 2024+); falsy cache check fixed; batch GROUP BY now includes id_orgao_superior |
| US2 Concentration | Medium — specialized markets | CGU 2022 methodology | Added min supplier spend to all 3 implementations; falsy cache check fixed; all 3 now GROUP BY (id+name) — no ministry-name collision |
| US3 Inexigibility | High — legitimate exclusive suppliers | TCU Acórdão 1.793/2011 | Fixed grouping by ID; added min value to all 3 implementations; falsy cache check fixed |
| US4 Single Bidder | Medium — specialized/remote markets | OCP 2024 Flag #1 | cache.ts bug fixed (getCache null-vs-undefined); batch now counts all participants (CPF+CNPJ) — consistent with per-CNPJ |
| US5 Always Winner | Was HIGH (no competitive filter) → Now Medium | OCDE 2021 | Fixed: competitive auctions only; raised thresholds; cache.ts bug fixed |
| US6 Amendment | Medium — inflation clauses | Lei 14.133/2021 art.125 | Added 10× inflation cap; cache.ts bug fixed; construction keyword detection: 1.50× threshold for obras/etc.; constructionCount in UI flag |
| US7 Newborn | High — spinoffs, restructurings | CGU 2021 guide | cache.ts bug fixed (was never querying BigQuery on cache miss) |
| US8 Surge | Medium — framework agreements, budget cycles | UNODC 2013 | Added consecutive-year guard; cache.ts bug fixed |