What every SaaS company adding AI to their product should know

It's 2023 and unless you're living under a rock - you're discussing how to add AI features in your product.

I don't know how well this will age, but 74% of startups are launching an AI feature.

Every startup people look up to, from Notion to Figma is adding AI powered features to their product.

SaaS products launching AI features - Graphic from: Jake Saper

We built and deployed multiple AI features in production. So today are sharing everything we learned, the good and the ugly.

Table of contents

Our journey
Lessons learned

Our journey

For context we're June, a product analytics startup, so we won't be talking about AI from a research perspective, but as a product company.

Our core is not AI, but we want to be at the cutting edge of building the best possible user experience using whatever technologies are available at the moment - including LLMs.

To date we've launched a couple of AI powered features in our product like:

How we got started

After seeing Natbot I tried playing around with it to see if I could use it to automate some manual tasks within June:

ChatGPT is cool. But what if it could control the browser and find answers on Google for you?

🤖 I wrote a small program that allows GPT-3 to browse the internet to find answers to your questions pic.twitter.com/RSMjtjNwVb
— Ferruccio Balestreri (@0xferruccio) December 3, 2022

Every time we got a new sign up we used to google the email of the users and try and understand both the company size and the role of the person.

Over a couple of weekends I put this together as a feature integrated with our core product - allowing any of our users to qualify their sign-ups in the same way as we do it.

We iterated a lot on the prompt for this and this was what we used to launch:

We had great reception on Twitter, and released this to all of our customers.

This is now one of the first things we get users to implement after they sign-up as it makes implementing June valuable from day 1.

After launching this though we realised there weren’t any other valuable use cases we could unlock for our customers using GPT-3.

From our experiments, GPT was great at extracting important information from unstructured data - for instance the title of a person from their Linkedin profile.

For more complex tasks, like identifying the most promising sign-ups of each week, or generating SQL queries the results weren't any good - so after launching our enrichment and qualification feature we continued working on our core product.

As soon as the GPT-4 API was released we started experimenting with it.

Our goal here though was to make sure to build something that solved a real problem for our customers and that wasn't just a gimmick.

The main challenge our product had was that a lot of free users weren't converting to paid plans because of our lack of flexibility.

So as we were trying to figure out solutions for this problem we asked ourselves:

"What's the most flexible thing we could give to our users?"

"What if we gave our users a SQL editor with an AI assistant - so that if they have some specific questions our product doesn't cover they can unblock themselves? Is GPT-4 smart enough to do this?"

At the start to be honest I was a bit skeptical of this being a viable solution. But Adis one of our founding engineers had strong opinions on this and decided to prototype this himself.

He started trying to get GPT-4 to generate SQL queries and the results were surprisingly good.

Within a few days he went from idea to a working prototype, that allowed us to add custom charts into any report of our product. The cool thing about this was that even our less technical teammates were now able to create some custom charts for some of our customers.

At this point the results were so promising that we decided to shift all of our engineering resources to polish the experience and release it to some customers.

In 10 days we managed to launch June AI in Beta.

✨ 🚨 Announcing June AI 🚨✨

The easiest way to answer product questions using your favorite language: English

Today we’re excited to announce 3 updates:

✨ We're releasing June AI
🦸‍♂️ Some of our heroes joined as investors
🆕 We’re hiring!

Keep reading 👇 pic.twitter.com/o52N5SAXRZ
— June (@JuneDotSo) April 5, 2023

The reception was great, and we got a lot of people to sign up for the beta.

The retention so far though has been much lower than for our core product. Looking at our feature report - it's roughly between 20-30% after 8 weeks.

Why is retention so low?

After drilling into the numbers what we learned is that the users that keep using this feature over time are the ones that are able to produce something they save on their first session.

Everyone else loses patience and trust if they don't get what they want on their first session.

By looking at the prompts people try - we also noticed something very unexpected.

Non-technical people have a much harder time prompting.

Our assumption here is that developers have already used to work with an AI assistant like Copilot or ChatGPT to solve problems collaboratively, so they have stronger prompting skills.

We're still happy we launched this feature early though - and as a team we built conviction around three things:

With time we can build better first time experiences - make the LLM ask clarifying questions
The more we generate queries, the more we can increase accuracy
In the next years general purpose models will keep getting better and better at writing SQL

Measuring and improving performance

The first version of our AI insights was very naive and used this prompt:

Make sure that SQL queries are correct and performant to avoid user error messages. Here's the data structure to work with:

          Table 1: all_user_traits (User Table) - this is a materialized view for every time a user gets identified with traits.

          |   Field     |        Type                  |
          |-------------|----------------------------  |
          | user_id     | String                       |
          | anonymous_id| String                       |
          | key         | LowCardinality(String)       |
          | value       | String                       |
          | timestamp   | DateTime('UTC')              |

          Available keys for 'all_user_traits': #{user_traits}

          Table 2: group_traits (Company Table) - this is a materialized view for every time a user gets identified with traits.

          | Field    |       Type                   |
          |----------|----------------------------  |
          | group_id | String                       |
          | user_id  | String                       |
          | key      | LowCardinality(String)       |
          | value    | String                       |
          | timestamp| DateTime('UTC')              |

          Available keys for 'group_traits': #{group_traits}

          Note: Synonyms for 'company' are: account, workspace, group. Synonyms for 'user' are: member, seat.
          Note: all_user_traits and group_traits tables can contain multiple values for the same key, but we need to pick only the latest value based on timestamp. Do that with argMax function from clickhouse by grouping on user_id and group_id respectively for both those tables.

          Example query:

          ```
          WITH
            argMax(value, timestamp) AS value
          SELECT
            user_id,
            key,
            value
          FROM all_user_traits
          WHERE notEmpty(user_id)
            AND is_active
          GROUP BY
            user_id,
            key
          ```

          Note: Applying specific filters on user or group traits should be done only after the argMax function is used to pick the latest value for each key as shown in the last example.

          Example query:

          - A query to find all users where email does not contain june.so would look like this:
          ```
          SELECT
            user_id
          FROM (
            WITH
              argMax(value, timestamp) AS value
            SELECT
              user_id,
              key,
              value
            FROM all_user_traits
            WHERE notEmpty(user_id)
              AND is_active
            GROUP BY
              user_id,
              key
          )
          GROUP BY
            user_id
          HAVING NOT groupBitOr(notEmpty(user_id) AND key = 'email' AND position('june.so' IN value) != 0)
          ```
          - A query to find all paying companies, i.e. a key called 'plan' with value of 'Paid' would look like this:
          ```
          SELECT
            group_id
          FROM (
            WITH
              argMax(value, timestamp) AS value
            SELECT
              group_id,
              key,
              value
            FROM group_traits
            WHERE notEmpty(group_id)
              AND is_active
            GROUP BY
              group_id,
              key
          )
          GROUP BY
            group_id
          HAVING groupBitOr(notEmpty(group_id) AND (key, value) = ('plan', 'Paid'))
          ```

          Table 3: events (Events Table)

          | Field              |      Type                          |
          |--------------------|-----------------------------------|
          | type               | UInt8                             |
          | name               | String                            |
          | user_id            | String                            |
          | group_id           | String                            |
          | anonymous_id       | String                            |
          | timestamp          | DateTime64(6, 'UTC')              |
          | context            | String                            |
          | properties         | String                            |
          | properties_keys    | Array(LowCardinality(String))     |
          | properties_values  | Array(LowCardinality(String))     |
          | source_name        | String                            |

          For the group_traits and all_user_traits tables, the key column represents different traits or attributes of a group or user respectively. This column is used for filtering based on these traits or attributes.
          For the events table, the name column represents different events. This column is used for filtering based on these events.
          When generating queries, assume that a value listed in the key column of group_traits or all_user_traits tables cannot be used as an event name in the events table.
          If a specific key or event name is to be used in a query, it will be clearly mentioned in the prompt. If not, make a reasonable assumption based on the context and the information available in the tables.

          DateTime Arithmetic: When performing arithmetic operations (like subtraction) between DateTime fields in ClickHouse, it is necessary to convert them into Unix timestamps using the toUnixTimestamp() function. This is because direct operations on DateTime fields are not supported. The resulting difference will be in seconds.
          Ensuring Same DateTime Formats: When comparing or operating on DateTime fields, make sure they are in the same format. If there are differences (like DateTime and DateTime64), use the appropriate conversion functions to ensure the fields are in the same format before proceeding with operations.

          Additional important rules:

          - NEVER use semicolons, you're writing Clickhouse queries.
          - NEVER use LIMIT. We're paginating results already.
          - In cases where data required to answer a question is spread across multiple tables, JOIN operations may be necessary. You should determine the common columns between these tables, which could be used to join them. For instance, the user_id column is common between all_user_traits and group_traits tables, and can be used for joining them to correlate user-related and company-related data.
          - NEVER do a join between events and all_user_traits tables.
          - NEVER do a join between events and group_traits tables.
          - The events table should only be filtered on user_id or the group_id columns and never joined with the all_user_traits or group_traits tables because of performance reasons.

          - A query to find the count of events performed per day by all paying companies in the last week
          ```
          SELECT
            toStartOfDay(timestamp) as day,
            COUNT() AS count
          FROM events
          WHERE timestamp >= now() - INTERVAL 1 WEEK
            AND group_id IN (
              SELECT
                group_id
              FROM (
                WITH
                  argMax(value, timestamp) AS value
                SELECT
                  group_id,
                  key,
                  value
                FROM group_traits
                WHERE notEmpty(group_id)
                  AND is_active
                GROUP BY
                  group_id,
                  key
              )
              GROUP BY
                group_id
              HAVING groupBitOr(notEmpty(group_id) AND (key, value) = ('plan', 'Paid'))
            )
          ```

          - When querying for relations between users and companies, consider the relationships implied by the schema. If a 'plan' value of 'Paid' in the all_user_traits table indicates a paying user, and you need to identify the companies those users belong to, you need to join this table with the group_traits table on the user_id field. This enables you to access the group_id field representing the companies.

          Example query:

          ```
          SELECT DISTINCT
            group_id,
            COUNT(DISTINCT user_id) AS paying_users
          FROM (
            SELECT
              user_id
            FROM (
              WITH
                argMax(value, timestamp) AS value
              SELECT
                user_id,
                key,
                value
              FROM all_user_traits
              WHERE notEmpty(user_id)
                AND is_active
              GROUP BY
                user_id,
                key
            )
            GROUP BY
              user_id
              HAVING groupBitOr(notEmpty(user_id) AND (key, value) = ('plan', 'Paid'))
          ) user_filters
          LEFT JOIN (
            SELECT DISTINCT
              group_id,
              user_id
            FROM group_traits
            WHERE notEmpty(group_id)
              AND is_active
          ) groups
          ON user_filters.user_id = groups.user_id
          GROUP BY group_id
          ```

          Top #{EVENT_LIMIT} most triggered event names for querying the 'name' column in the 'events' table: #{event_names}

          #{report_context? ? report_setup_prompt : ''}

          Note: If you do not see an event that resembles a sign up event from the list, use `MIN(timestamp) FROM events` as this will represent the first event the user triggered, which is a good default when there is no sign up event.
          Note: If you see 'First event', use `MIN(timestamp) FROM events` because it represents the first event the user triggered.
          Note: If you see 'Any event', do not use any event filter in the query because that'll make the query look for all events.
          Note: There are two types of events - a track event (type = 2) and page event (type = 0)

          - For accessing specific properties from the events table, remember that the properties_keys field is an array containing keys, and properties_values is an array containing corresponding values. Use indexOf to find the position of a specific key in properties_keys, and use this position to access the corresponding value in properties_values.
          - When querying properties within arrays, use the arrayFirstIndex() function to find the index of a specific property in the properties_keys array.
          - Then, use the obtained index to fetch the corresponding value from the properties_values array.
          - To check the presence of a property in the properties_keys array, utilize the arrayExists() function.

          Example query:

          SELECT
            trim(BOTH ' ' from value) as trimmed_value,
            COUNT(value) as count
          FROM
          (
            SELECT
              arrayFirstIndex(x -> x = 'device''s type', properties_keys) as index,
              properties_values[index] as value
            FROM events
            WHERE type = 2
              AND name = 'retention_event'
              AND arrayExists(x -> x = 'device''s type', properties_keys) = 1
              AND is_active
          )
          GROUP BY trimmed_value
          ORDER BY count DESC

          Here is your conversation so far with the user: "#{transcript}", in case it's useful. "#{error}"

          The most important rule is respond only with SQL query, without any other natural language instructions. If you can't convert the prompt into a SQL query, return the error message: 'Unable to generate SQL query.'. Never respond with any other message.

Once we started iterating on the prompt and shipping improvements we built some systems to measure accuracy.

We built a small admin panel to be able to switch between different prompt templates and models and test how accurate they are.

An example of prompt metrics from a prompt version that we were testing

What we're going to do next is ingest tens of thousands of correct queries into Embeddings.

Once we have those embeddings we can run a search of similar questions customers asked and dynamically include 1-2 similar queries into the prompts we use to generate SQL queries.

This should allow us to progressively increase accuracy on any query once it has been run once.

What we're also trying is benchmarking this approach with fine-tuning.

Something wwe learned about recently is that there's now products that help you manage internal training processes, the one we're trying is SuperAnnotate.

As our LLM fine-tuning pipelines become more complex we found it's pretty useful. They allow you to build with a couple lines of Python some UIs that allow for output rating.

They have some open-source templates that make it easy to get started.

Lessons learned

Here are some of the most important lessons we learned so far.

Bet on general intelligence over specialised intelligence

If you're trying to solve a specific problem like "Generate a SQL query from an english sentence" general approaches with models that use more computation, beat specialised custom models.

It's always preferable to build things assuming that:

We will get more intelligent models
The price to run these models will go down over time

If you could pick between SQL generated by GPT-4 with a static prompt or SQL generated by fine-tuned GPT-3 with a similar working query attached as an example you'd prefer using GPT-4.

What this means is that we try and not invest on marginal accuracy improvements as in the long run, smarter models will give you those gains out of the box.

If something takes more than 20 seconds to do AI can help

If you have some complex setup that takes more than 20 seconds to complete you can consider adding AI to make it better.

Consider that it takes about 5-6 seconds to get a completion for a task and some time to type in what you want. So for now the minimum time a task needs to take to become better with AI is 20 seconds.

Everything that normally takes less than 20 seconds can't get done faster with AI.

A good example of this is setting an alarm with Siri. Saying "Set an alarm for 7am tomorrow morning" saves you 30 seconds of finding the right app and filling in information.

An example where it makes little sense to improve a flow with AI is something very easy to access. For instance, from your phone's lock screen you'd never say something like "Hey Siri, open the camera app". It's faster to just do it yourself.

Keep a human in the loop

Our first attempt at building AI assisted SQL generation

Every popular AI app puts the human in control and is just an assistant. Don't try and build anything too smart, in whatever you build give your users the power to accept, retry or reject proposals from your chosen AI model.

With our first iteration of our AI powered SQL editor we didn't do this enough and we saw it in our metrics that correct completions weren't nearly as high as with our new conversational AI.

Human feedback is still required to steer LLMs in the direction we want them to go.

Work on retention

Retention is the measure of ongoing value. If you're building a SaaS company you want to keep delivering value over time.

The fundamentals of building something people want haven't changed with AI. If you can build something with AI (or not) that become part of people's lives you've built something of value.

If you want to learn more about how to think about retention for your product we wrote a comprehensive playbook for early stage startups.