Using ChatGPT as THE Interface to a BI System

Using ChatGPT as THE Interface to a BI System

By Gary Angel


May 7, 2024

ChatGPT and LLMs as the Front-End to a BI System

We started our ChatGPT/LLM integration process with a SQL generator and then followed that up with an insights generator based on specific reports a user pulled in the UI of our platform. Getting these to run well turned out to be a decent amount of work, but each ended up providing something of value. On the other hand, I don’t think anyone would look at the results from those first two efforts and consider them revolutionary. They are nice extensions to our core platform, but no more than that. We’ve added plenty of other features that had more value. The real promise of ChatGPT in BI isn’t to serve as a SQL generator or even an insights engine; it’s to serve as the primary interface to your data. If users still have to build tabular reports and charts or try to get insight from dashboards, then LLMs haven’t really transformed much of anything. So, after building out those first two phases and learning a fair amount, we tackled this more ambitious goal – making ChatGPT the primary interface to our product. To paraphrase Bill Murray in Groundhog Day, we didn’t want ChatGPT to be an interface to our people measurement data, we wanted it to be THE interface.





To do that, we needed to train ChatGPT on our data. OpenAI does support that. You can point ChatGPT to a document corpus and have it train on your materials. This is the process you go through if you are building, for example, a custom ChatBot. It’s a way to give ChatGPT insight into your vocabulary and your specific data. We’d been using the context for this in our previous efforts, but while ChatGPT’s context has expanded considerably, you can’t fit an entire database into it. So our first step was to take our data and create a document template for it that could then be ingested in training.


Naturally, we used everything we had discovered in the first two rounds about context. We loaded in our metric explanations and all sorts of usage examples in templatized form. Then we created a batch process for filling in the templatized data. There are a lot of different ways to do this. Our data isn’t stored as rows in a database – it’s event level data. Loading it that way into an LLM wouldn’t make a lot of sense (I think). Instead, for our first approach, we borrowed what we do when we create data feeds.


We have what we call the “OLAP” feed which contains every collected metric at a low time granularity for every defined map region in a space. This seemed like a good level for insight generation and it’s the kind of feed that can support a lot of traditional BI analytics. Of course, it didn’t hurt that we already had a nice process for generating this data from the event-level and pushing it into a data lake. We just had to modify that process to create template documents.


Our second approach was different and I think more creative. Forget about all the metrics and dimensions and stuff – our data is about customer journeys. So instead of representing the data as a table, we turned each customer journey into a short narrative.


“Person 12345 entered Store X at 12.15 and moved from Entrance to Common. They spent 2 seconds in Entrance. In Common they spent Y seconds. They were in motion for x% of that time with an average velocity of Z. Then they moved from Common to…”


And so on. And yes, we really do have data this detailed (and more). This journey representation is voluminous, but it does capture a more textual approach to the data and it can help ChatGPT answer pathing questions of considerable complexity.


We decided to test with a small subset of our data – so we started with a single month for a single client. We figured if that worked well, it would be easy enough to expand.



Issues with Training


I’ll say upfront that training ChatGPT was a frustrating process that has not yet lived up to our expectations or hopes. We found that there were four significant issues with custom training on ChatGPT. Maybe others have solved these problems or maybe their problem is different, but we found that getting ChatGPT to train well on our kind of data was really challenging. The problems we struggled with were: accuracy, emphasis, cost and time to train.





This was hard. Getting ChatGPT to do a reasonable job answering detailed data questions proved much harder with training than with context. What worked in context helped in training but ChatGPT seemed to constantly over-focus on early training and miss huge chunks of data. It was hard to get it to differentiate data appropriately and many of the tricks we used effectively in context simply didn’t seem to register in training. We never fully defeated this problem though we made some progress to improve it. Inaccuracy was particularly telling with our OLAP approach to training. For journey narratives, ChatGPT wasn’t great about generating metrics, but it was pretty good about answering questions about common paths and whether people who did X also did Y.





One aspect of training is that your corpus is incorporated into ChatGPT’s general knowledge system. That means that ChatGPT can bring all of it’s vast information base to bear on questions. That can be great because it can use what it knows about stores or products to add insight to the data. But it can also mean that the answers it provides spin off in directions that aren’t useful or reflective of the actual data. We found that getting ChatGPT to focus appropriately on our data was hard and that what worked in the context to do this often didn’t when we just trained.





While queries to ChatGPT are quite inexpensive, training ChatGPT isn’t. Now, to be fair, OpenAI doesn’t think about training as something you will be doing every day. For most companies, they will enter a corpus of documents (about their product, for example) and will only have to update that corpus and re-train periodically. But a BI solution doesn’t work that way. We get data in real-time and our slowest report cycle is daily. We avoided this problem in our testing by just training against a static month, but it was obvious that if we wanted ChatGPT to become the interface to the tool, it was going to be quite expensive to train on a daily basis and almost impossible to train more often than that.



Time to Train


In a very related problem, training ChatGPT on your custom corpus is time consuming. This may be a matter of job prioritization and, again, I understand why that is. Our use case isn’t the standard one. But our training cycle times for even a month of data for single client were often in the many minutes to hours range. That’s okay for testing, but it looked unsustainable for production.



Where we came out on training


It was clear to us that even if we solved our accuracy and emphasis problems (which we mostly didn’t), that this approach wasn’t sustainable for a production system simply on a cost and time basis. To make this approach successful, we were going to have to abandon ChatGPT and work with something like LLaMA. That’s disappointing for a variety of reasons, but it’s where we came out.


But before we gave up entirely, we did experiment with a compromise solution. We knew we couldn’t use context to fully support product queries, but it seemed possible that with ChatGPT’s greatly expanded context we could use it to support queries about what’s new / changed outside of the limited scope of highlighting report pulls. Call this “Ambitious Plan B.” Or maybe “Ambition Light.”





We’d successfully used large contexts to improve our ChatGPT results in SQL generation and highlighting, so we had a pretty good idea about what context could do and how to do it. But even though ChatGPT had greatly expanded the size of the context it’s not like we could stuff our entire database into it or even all the data for a single client. Inevitably, then, we had to give up on the idea of using ChatGPT as a primary interface to the data.


Nevertheless, the most common ongoing usage of the tool is for people to check-in on how their data has changed. We knew that using context not training would limit how much data ChatGPT could have, so we decided to focus it on highlighting changes in the most recent data. This isn’t exactly an alerting system – more a “What’s New” approach. Our thought was that we could create a ChatGPT view when users logged in that pointed out interesting changes in the data.


For the most part, this approach worked and all of the same tricks we applied in creating context for highlighting apply here as well. We did find that our costs became a little more significant with this approach because we had to pass in quite a large context with each login. We also found that the lag time in generating responses made it trickier to integrate into the user flow. We would have liked to just pop a window post login, but waiting for a query to complete made the login seem frustratingly slow.


We did find a few new techniques that significantly improved the quality of ChatGPT content in generating a “What’s New.”



Measures of Significance


Left on its own, ChatGPT doesn’t know or understand much about whether changes are statistically significant. It tends to focus on the raw size of changes when calling out findings, without accounting for the underlying size of the data or its natural variation. This meant that most of the “What’s New” or “Different” questions we asked returned stupid answers. To improve this, we pre-calculated measures of variation and gave ChatGPT the raw numbers and the standard deviation of the changes along with text explaining what levels of standard deviation were significant. We also removed most very small rows from the context. This has some risk, but since our goal wasn’t to support general queries but to focus on changes, eliminating small quantities has two big benefits. It keeps ChatGPT from calling out significant changes when, for example, a store section goes from 5 visits to 8 visits, and it considerably reduces the size of the context.



Focus on Change


We found that it worked better in the context to provide actual change data not the underlying period data. In other words, instead of this:


Store Section   June Visits         July Visits


We provided something like this:


Store Section   Visit Change     Visit Percentage Change            Visit Change Stddev


The more specific your data is to the purpose, the better ChatGPT tends to with it when drawing out insights.



Define Interesting


It helps to put some thought into what’s interesting about a change. Without guidance, ChatGPT will tend to pick the largest change or the largest change adjusted for variability. That’s okay, but it’s rarely optimal. For a truly general purpose system, there may not be much you can do to improve this behavior. But for a purpose built BI system, it’s possible to give ChatGPT more guidance about what interesting is.


We found that we could improve output by adding vertical specific guidance to our context. In general, we did this by providing a hierarchy of interest – telling ChatGPT to focus on significant variations in A, then B, then C.


For retail stores, for example, this hierarchy of interest might be conversion rates, relative usage of merchandising areas, and changes in shopper to associate ratios. For an airport, the hierarchy of interest might be changes in usage by day-time part, changes in wait-time by queue, and changes in throughput by queue.




Using ChatGPT with training as a general interface to our data didn’t turn out to be effective. That doesn’t mean it couldn’t be – perhaps we just didn’t crack the code. But based on our experience, it would be challenging and operationally difficult to build a robust ChatGPT interface into custom data that changes daily. Even if you (or OpenAI) solve the accuracy issues, the cost and time cycles look problematic. Our fallback of using context and “What’s New” to limit the process worked but is really just a small addition to what we’d already done. I described it as “ambition light” but its really more like ambition very light.


Our next step is to try and use an open-source LLM to do this – training intensively on our data. That’s a bigger project though, and I don’t expect to have an update on it for awhile!

Leave a Reply