Florian Valeye’s Post

View profile for Florian Valeye, graphic

Staff Data Engineer at Back Market

Data is my currency for my decisions. 📊 By analyzing them through dashboards on top of trusted datasets, I have a way of getting visually attractive answers to my predefined questions: • What are the top-selling products this month? • What is the number of orders for today? • What is the number of concurrent requests on our website? However, adapting the dashboard often requires manipulating additional filters or delving deeper into specific sections to achieve a general purpose and sometimes wholly changing the dashboard. My dashboard's tool and SQL are my dear old friends in this journey! 🤖 I dreamt of having a Q&A data assistant to adapt to my data questions in near real-time. That's why I started playing with the most performant Large Language Models with a UI to give me my particular answers! 📚 Here are my findings from designing a "Data Insighter" Retrieval Augmented Generation on datasets! Before starting, many iterations are dedicated to improving all available documentation: KPIs definition, key concepts and shared knowledge among data users, table definition, and column definition. Besides, consider the security access and all underlying policies on your datasets and infrastructure. And, of course, LLM is the cherry on the cake; in other words, don't put a cherry on top of a fire if you would like to enjoy it! 🍒 ➤ Step 1: Help your super technical assistant to choose the best tables • Provide general documentation with concepts and guidance on the technical infrastructure • Complement the LLM by putting in an expert position on this technical infrastructure • Provide a list of the table's schema, table's description, and associations • Ask to select the most relevant tables for your question ➤ Step 2: Provide more details on the most relevant tables selected • Provide an extract of the selected tables, the lineage, and the column's comments • Ask to generate the perfect SQL query for the question asked, and limited in row and SELECT only • Use the markdown to extract the SQL among the explanations ➤ Step 3: Failing is learning • Launch the generated SQL query on your infrastructure • Retry if it's failing by giving the LLM the opportunity to fix it! • The answer is nearly here with the Pandas DataFrame format ➤ Step 4: Let's recap! • Summarize and provide the best markdown format • Display a graph is an option ➤ Step 5: Measure and iterate! • Simple assertion tests on questions, table selected, and SQL comparison • Understand the limitations and limit the scope of the data domains of your queries That's it for now. Even if it's not entirely working, drafting SQL and good documentation are always a boost for everyone in terms of productivity! Please keep in mind that you'll need to provide this service to data experts first to have an SQL verification step before sharing wrong insights at a large scale. 🥁 #dataengineering #ai #largelanguagemodel #datainsights #kpis #datavisualization

  • Text to SQL with LLMs
Denny Lee

data engineering and analytics geek (we’re hiring)

4mo

This is really cool! Loving your push and advocacy of generative AI.

SOUMYA ELAYEDATH HARIDAS

Data Analyst | Python | SQL | R | POWER BI | TABLEAU | Data Visualization Specialist | Transforming Data into Strategic Insights

4mo

Highlighting the pivotal role of data through this visualization is incredibly compelling! It's a powerful reminder of how data empowers decision-making and shapes our understanding of the world.

Masood Joukar

Data & AI Advisory Architect

4mo

very cool Florian. in my opinion, as the companies more & more move toward data & ai drivenness such use cases are by far more intresting & bring companies more value than reporting. the reason you have already mentioned "answering questions in near real-time.".

See more comments

To view or add a comment, sign in

Explore topics