RAG and Text-to-SQL over Open Government Data
PhD research on chat interfaces over municipal and federal datasets — Retrieval-Augmented Generation with verifiable citations, plus Text-to-SQL for structured public datasets.
- Prototypes against municipal open-data portals
- Citation-tracked answers from verifiable sources
- Evaluation sourced from real public-information requests
Context
Open government data is, in theory, the most accessible data in the world: every municipality, every federal agency, every public-services dashboard publishes some version of it. In practice it's locked behind portals, PDFs, CSVs with no schema documentation, and dashboards built for the agency that produced them — not the citizen who needs to ask a question.
The research question is straightforward: can a citizen with no SQL, no data tooling, and no patience for portal navigation get a credible answer to a real question about their city or their country, in their own words?
Role
PhD Researcher at Galileo University. I own the research framing, the prototype implementations, and the evaluation work.
What I'm building
Two complementary techniques layered into one chat surface:
- Retrieval-Augmented Generation. Indexed corpora of municipal and federal documents — budgets, contracts, public-services reports — surfaced into the LLM context at query time, with citation tracking so a citizen can verify the answer against the source.
- Text-to-SQL. For datasets that live in structured form (open data portals, agency databases), the chat layer translates the citizen's question into SQL against the public dataset and returns the answer with the underlying query exposed for audit.
The two paths cover different question shapes — narrative ("what does the city say about transit?") vs. structured ("how many permits were issued in District 4 last quarter?") — and the same chat surface routes between them.
Research decisions worth writing down
Citations are not optional. A government-data chatbot that answers without verifiable provenance is worse than no chatbot at all — it manufactures confidence. Every answer in the prototype links back to the source row, page, or document.
Text-to-SQL needs schema documentation that doesn't exist. Most open data portals publish CSVs with no schema, no column descriptions, no data dictionary. A real Text-to-SQL stack has to either reconstruct that documentation from the data itself or sit on top of curated portals where it does exist.
Non-technical first. The target user is a citizen, not an analyst. The evaluation framework reflects that — questions are sourced from real public-information requests, not from SQL benchmarks.
Status
Research in progress. Prototypes have been built against municipal datasets; the Text-to-SQL evaluation work is the current focus. Publications and dissertation work ongoing.