ChatGPT and Highlight Generation: Improving a Master of the Obvious
ChatGPT and Highlight Generation: Improving a Master of the Obvious
By Gary Angel
|May 3, 2024
While SQL generation was a good place to with our ChatGPT integration efforts, it was inherently limited as a use-case. Most people who need data have enough familiarity with SQL to get it (and most people who don’t know SQL don’t have much use for the underlying data). A tool that creates SQL is only going to enable a small number of additional users. To create more value from ChatGPT, we wanted to use it as a kind of co-pilot and pseudo-analyst inside our report generation. Our thought was to integrate ChatGPT into every report pull by passing ChatGPT the underlying data – either when the user requested it or perhaps with each pull.
We figured the most logical place to start was with our report builder module – where users select dimensions and metrics to build basic report tables.
The interface supports a bunch of different metrics and reports can be trended or grouped in a lot of different ways. You can also apply segmentation to filter reports by specific user types.
This kind of table is easy to give to ChatGPT and ask it for insights. We started (and, as it turned out, ended) with a pull mechanism – a button for generating insights. I’ll explain why later. But we started with this mostly because it was easier. We added the Insights button to the left nav strip and when you user clicks it, we pass the current data table to ChatGPT and then pop the results back into a frame on the right side of the table:
That window can be minimized or restored, and it keeps a record of the data sent and the insights generated:
If you read my last post on SQL generation, it will probably come as no surprise that our initial Highlight requests didn’t work perfectly. Here’s an example of what ChatGPT generated out of the box when we asked it for highlights and analysis of a typical pull:
The provided JSON structure is an array of objects, each representing a location with a name and a corresponding number of “total_draws”. Here are some insights that can be derived from this data: 1. Total Number of Draws: The sum of all “total_draws” across all locations can be calculated to understand the total volume of draws. 2. Highest and Lowest Draws: Identifying which location has the highest number of draws and which has the lowest can provide insights into the most and least frequented or utilized locations. 3. Comparison by Location: By comparing the “total_draws” of locations with similar names we can infer which areas of the same location are more popular or active. 4. Interior vs. Site: We can compare the total number of draws for “Interior” locations versus “Site” locations to see if there is a trend in terms of which type of location has more activity. 5. Average Draws: Calculating the average number of draws across all locations can provide a baseline for what is typical within this dataset. 6. Distribution of Draws: Analyzing the distribution can help to understand if the draws are evenly spread across locations or if there are outliers with significantly higher or lower numbers. Here’s a brief analysis based on the provided data: – Total Number of Draws: The sum of all “total_draws” is 361,771. – Highest Draws: “[Location Name]” has the highest number of draws with 65,714. – Lowest Draws: “[Location Name]” has the lowest number of draws with 10,413. – Interior vs. Site: If we sum the “total_draws” for all “Interior” locations and compare it to the sum for all “Site” locations, we can see which type of location has more draws. – Average Draws: The average number of draws across these 10 locations is 36,177.1. To provide a more detailed analysis, one could perform statistical calculations or visualize the data using charts or graphs. For example, a bar chart could help to visually compare the “total_draws” across different locations.
That’s pretty awful. Whatever time you spent reading those highlights is time lost from your life with absolutely no return.
As with SQL generation, it was clear that to get ChatGPT to do better, we needed to provide it context. In some respects, context here was simpler than with SQL. We didn’t need to worry about table structures or query performance. The data is always restricted to what got pulled in the interface. There aren’t any performance implications. And the structure is always simple and pre-defined.
Metric Definition and Explanation
But ChatGPT was still struggling to say anything interesting and the first and most obvious problem was that it didn’t necessarily understand our metrics. We already had a help file that documents every metric in the system, so we lifted that and incorporated it into the context. Here’s a sample:
Dwells Unique A unique count of when a customer spent consecutive time in an area exceeding the minimum threshold set for a Dwell (by default 60 seconds but customizable in the Admin>Settings interface). A customer can have no more than one Dwells Unique per area per visit. If a customer enters an area, spends enough time there, leaves, re-enters the area and again spends sufficient time there, only one Dwells Unique will be counted for that area.
Instead of placing the entire file into the context, we dynamically evaluate which metrics are in the report request and we provide these as context to ChatGPT dynamically. While this helped, we found that these technical descriptions weren’t great for providing the color around the metric necessary for ChatGPT to really shine. So, we added more commentary about each metric. Here’s a sample of how we tell ChatGPT about “visits”:
Visits is a count of people. Each person that comes to the location is a visit. The higher the visit count, the more people visited. In general, increased visit counts are good and decreases are bad. If visit counts are trending in one direction, it’s always interesting to see if other measures – such as dwells, engagements, or conversion are trending in the same direction. If not, then it usually means that the location is experiencing changes in the type of people coming to the location.
Analytic Guidance
This description provides additional vocabulary, and it tells ChatGPT the meaning of directional changes. This really helps crisp up the highlights. In addition, it suggests other metrics to ChatGPT that can place useful context around visits. This helps ChatGPT generate suggestions for additional queries.
Here’s another example from the context explaining the “Draws” metric:
A draw is the first place that people dwelled (spent time or lingered) in an area or section. Draws are measured by Section or area. The more draws a section had, the more likely it is to be the reason people came to a store. A draw is also a measure of the importance of the section to the store. Finally, the ratio of Draws to Visits for a section is interesting – this is the draw rate. If the ratio of Draws to Visits is high, then the section is mostly being used by people who go directly there. If it’s low, then shoppers tend to find their way to the section as they shop. In general, the higher the percentage of draws a section has, the more important it is to the store.
Defining draws is important because the underlying behavior isn’t necessarily obvious from the metric name (which appears in the column header). It’s obvious from the ChatGPT extract above that it had no idea what the metric meant. Again, though, the analytic instructions here really help ChatGPT understand how to think about this metric. When I look at store metrics, I’m using Draws to understand what brought people to a store – and that’s the way I want ChatGPT to use it when it generates highlights.
We also provided context around the different forms most of our people-measurement variables could take. We tell ChatGPT about the difference between the count, the rate, and the share metrics. This helps ChatGPT attach meaning to metrics and trends. This piece of context really helped sharpen the highlights provided back:
In general, the basic count metric is best used for understanding trends in usage. The Rate metric is best used for understanding trends in performance. And the share metric is used to understand the relative importance or performance from one area or section to another.
Using this ChatGPT can better explain to a user what a report means AND suggest additional metrics to add to the report that would allow for additional conclusions and highlights.
Trending & Time Granularities
Another area where we provide a lot of analytical guidance to ChatGPT was around trending. Anyone who’s used ChatGPT knows that it is often a Master of the Obvious. It will pick out the overall trend when it’s obvious and it can call out the highest and lowest periods or locations in a report. Well…that’s not nothing. There are professional sports commentators who don’t do much better. But it’s a waste of ChatGPT’s capabilities, because while those facts might save someone a smidge of time (or not), they are things EVERYONE knows how to do.
Our goal is to get ChatGPT to help people think well about their data. And that means we need to tell ChatGPT a little more about how to do that.
In our data, good thinking about trends begins with what’s being trended – typically either locations or areas inside a single location. When you trend locations, you’re usually interested in which locations are trending up or down and which locations have changed the most. Locations are comparable. When you’re trending areas inside a location, it’s not necessarily the same thing. It may be interesting to know that baseball gear is trending up and football gear is trending down, but that’s not necessarily a measure of performance so much as a shift in visitor interest.
In our context, we tell ChatGPT about how to think about each of key dimensions when it’s trended.
But there’s another aspect of trending that’s easy to overlook. All trends pick some time granularity to group data. In our platform, you can go as low as 10 minutes and as high as a year. A trend over 7 days is a very different beast than a trend over 7 years. We want ChatGPT to be sensitive to the time granularity and to make sure its recommendations fit appropriately.
For example, if you trend a single day by 10 minutes, and see that the number of visits is going up toward the end of the day, it’s almost always wrong to suggest that visits are increasing. But if you trend 2 years by month, such a conclusion might be sound.
In our context, we tell ChatGPT to use the time granularity to suggest different kinds of highlights. If the date range is limited and the time granularity is small, we want ChatGPT to focus on which parts of the day were slow or busy or how engagement changed based on daytime part.
We want to keep that focus for any report where the time granularity is less than daily.
As time granularity goes up, however, the types of highlights that are interesting change. At the daily level, we may be interested in Day of Week trends. Our data tends to be highly sensitive to the day of week (all retail data is but a lot of location data outside retail tends to be as well), so telling ChatGPT to focus on that when a report is run at the daily level usually provides better highlights. Each time granularity comes with a set of expectations the analyst brings to the table about what it’s good for and how it can be used. That’s the kind of thinking ChatGPT doesn’t do naturally but it will follow your lead if you provide the context.
Focusing
The last thing we do in the ChatGPT integration with our table report generator is provide what we call focusing. When you ask ChatGPT for highlights or insights, you’re going to get a decidedly mixed bag of tricks. In our SQL generation, the user is asking the question, so we don’t need to think about focus. But in this case, the user question is implicit in the data pull. In many respects, the Trending context that we provide ChatGPT is instruction on how to get at the implicit meaning of the table request. Trending requests turn out to be easier than non-trending requests in this regard because they are necessarily focused on some kind of change over time and the time granularity usually controls what that is. But what’s a good focusing question if the user pulls a non-trended report of the visits and dwell rate for areas of a location?
In our context, we need to help turn that request into a question – and the right question is very dependent on what data was selected for the report. ChatGPT will often do a surprisingly good job of this just based on the information we’ve given it about metrics. But without focusing, it often spews out a lot of conclusions – some of which are interesting and many of which aren’t.
Here’s an example of a pretty good inference that ChatGPT generated:
– The Packaged Beverages section has a high Draw Rate of 15.33%, which is impressive given its lower visit count compared to some other sections. This suggests that while fewer people visit this section, those who do are likely there with the intent to purchase.
That’s just from the information we provided about Draw Rate and how to think about it. Compare that to the initial pull above and you can see how context has transformed ChatGPT highlighting from completely useless to pretty decent.
There were still some pretty useless ones like this, though:
1. Entrance: – Highest number of visits (3,433), but very low average time spent (0.05 min), indicating that while it’s the most trafficked area, people pass through quickly, as expected for an entrance.
Unfortunately, without focusing the useless highlights tend to outnumber the useful about 3-1. To focus ChatGPT with a better set of questions than just “tell me what’s interesting,” we tell it what to focus on when a non-trended report includes certain dimensions or variables. This doesn’t always work because we can’t be comprehensive, but we can help ChatGPT in a lot of cases.’
Think about it this way. If a user has selected two variables in a report, then what they’re probably looking for is cases where one is high or low relative to the other. When a report has more variables, it’s often the same story, but there’s typically one or two variables that the user is looking to relate to changes in the others.
To help ChatGPT figure out how to parse that, we provide a prioritization of variables. For example, if a report includes conversion metrics, we encourage ChatGPT to focus on that – since that also tends to be where analysts will go first. If a report includes aspects of a conversion funnel, we’ll ask ChatGPT to focus on the funnel parts that are there. If a report includes engagement and visits data, we’ll ask ChatGPT to focus on engagement relative to usage.
Finally, we like to give ChatGPT some pointers about what’s not interesting. In our data, the Entrance and Walkways tend not to be interesting in most locations. So we encourage ChatGPT not to focus on them. We also tend to discourage it from highlighting engagement or journey metrics at the Point of Sale. If you checked out, you spent a chunk of time at the PoS, but it doesn’t mean anything.
Admittedly, this kind of context building can’t be done in a truly general-purpose BI tool. But if your BI system is internal or vertically focused (like ours), then you can make a ChatGPT driven highlight generator a lot less of a MOTO.
Push vs. Pull
I mentioned that when we first implemented the highlighting feature, we debated push vs. pull. If we sent every request to ChatGPT, we just make the highlights a constant companion to every report pull. We could even use these to build up a kind of conversation. This seemed ideal but we wanted to test how the system worked before committing to using it constantly. So, we decided to start with a pull model. In the pull model, the user must request insights on a table they’ve already generated.
After implementing the feature, we found that limitations in the ChatGPT API made it necessary for us to stick with the Pull model. The API is often very slow to return. Most data pulls in our system are nearly instantaneous. It’s a rare request that takes more than 2 seconds to return. Requests to the ChatGPT API, however, usually take much longer than that – often 10-20 seconds. Brutal. That’s too long to wait for a simple report. And, of course, we have to pay every time we hit the ChatGPT interface. Individual questions are very inexpensive. Still, our questions tend to be context heavy (and we’re passing a whole table of data each time). Our users tend to quickly generate a lot of table requests when they are in the workbench. It’s the kind of tool you just slam around in. The killer really wasn’t economics it was query performance, but I won’t deny that cost matters too.
We haven’t given up hope of incorporating highlights into every request. For our on-premises clients, cost wouldn’t be an issue and we’re also testing approaches that use Llama both to improve return performance and eliminate most of the per query cost.
Learnings
#1. Context is King: As with our SQL Generator, the key to getting good ChatGPT results was thoughtful use of the context. We pass the data table generated and a dynamic, programmatically generated context to get decent highlights. If you’re building and internal or vertical-focused ChatGPT integration, you’ll have to do the same to get decent results.
#2. Metrics Definition: ChatGPT will understand some of your metrics and dimensions but not others. You need to explain what’s what.
#3. Tie Direction to Performance: Okay, this one kills me because I always tell people that KPIs should never be simply interpreted by direction. And you know what – everyone does it anyway. So, if ChatGPT is going to be useful, it needs to understand how people think about the data and whether, for example, more visits is good or bad (good for a store, bad for the customer support area). Highlights are better when ChatGPT understands how the user will probably think about whether changes are good or bad.
#4. Focus: Telling ChatGPT about date granularities is important to getting its highlights to feel more appropriate to the way people think about the data. It’s also pretty easy. Focusing ChatGPT when data isn’t trended is almost like an AI problem unto itself. Our best efforts so far use a prioritization of variables and some simple instructions about how to handle reports with 1, 2, 3, or more variables.
#5. Performance & Cost: In one sense, performance isn’t terrible. Waiting 10 seconds to get useful highlights is kind of reasonable. But it’s too long to bundle into the query return and we could see that cost was going to become an issue for us in our SaaS cloud-based model. If you’re thinking about ChatGPT integration into your internal BI systems, these are issues that are going to matter. And not to go all foreshadowy, but cost became an even bigger issue when we started to work on our last ChatGPT step – creating a generalized ChatGPT interface to the data.
#6. It’s workable: We felt okay about the final product. It isn’t world-shaking, but it’s much better than we could do on our own and it sometimes provides legitimately useful callouts. If you’ve used as many disappointing highlight systems as I have, that’s not too bad.