AI / Research2024 — PresentPhD Researcher

RAG and Text-to-SQL over Open Government Data

PhD research on chat interfaces over municipal and federal datasets — Retrieval-Augmented Generation with verifiable citations, plus Text-to-SQL for structured public datasets.

RAGText-to-SQLLLMsPythonOpen Gov Data

Prototypes against municipal open-data portals
Citation-tracked answers from verifiable sources
Evaluation sourced from real public-information requests

Context

Open government data is, in theory, the most accessible data in the world: every municipality, every federal agency, every public-services dashboard publishes some version of it. In practice it's locked behind portals, PDFs, CSVs with no schema documentation, and dashboards built for the agency that produced them — not the citizen who needs to ask a question.

The research question is straightforward: can a citizen with no SQL, no data tooling, and no patience for portal navigation get a credible answer to a real question about their city or their country, in their own words?

Role

PhD Researcher at Galileo University. I own the research framing, the prototype implementations, and the evaluation work.

What I'm building

Two complementary techniques layered into one chat surface:

Retrieval-Augmented Generation. Indexed corpora of municipal and federal documents — budgets, contracts, public-services reports — surfaced into the LLM context at query time, with citation tracking so a citizen can verify the answer against the source.
Text-to-SQL. For datasets that live in structured form (open data portals, agency databases), the chat layer translates the citizen's question into SQL against the public dataset and returns the answer with the underlying query exposed for audit.

The two paths cover different question shapes — narrative ("what does the city say about transit?") vs. structured ("how many permits were issued in District 4 last quarter?") — and the same chat surface routes between them.

Research decisions worth writing down

Citations are not optional. A government-data chatbot that answers without verifiable provenance is worse than no chatbot at all — it manufactures confidence. Every answer in the prototype links back to the source row, page, or document.

Text-to-SQL needs schema documentation that doesn't exist. Most open data portals publish CSVs with no schema, no column descriptions, no data dictionary. A real Text-to-SQL stack has to either reconstruct that documentation from the data itself or sit on top of curated portals where it does exist.

Non-technical first. The target user is a citizen, not an analyst. The evaluation framework reflects that — questions are sourced from real public-information requests, not from SQL benchmarks.

Status

Research in progress. Prototypes have been built against municipal datasets; the Text-to-SQL evaluation work is the current focus. Publications and dissertation work ongoing.