Many organizations are inundated with vast amounts of unstructured text data, such as user feedback, social media interactions, customer service call logs, and much more. The volume of this information can be overwhelming, making it challenging to extract actionable insights.Â
Pinecone and Canopy provide a powerful set of tooling to help digest and analyze this unstructured text, allowing businesses to quickly and easily transform their rich datasets into business value.
In this blog, we’ll explore the practical applications of Pinecone and Canopy to unearth valuable insights from user reviews, revealing patterns and sentiments that can drive strategic decisions.Â
Why Use Pinecone to Understand User Feedback?
Developing apps is a demanding endeavor that often requires resilience, particularly when those apps are made available on public platforms like Google Play and the Apple App Store. Users, shielded by a thin layer of anonymity, freely express their most extreme opinions—both positive and negative, as well as the occasional quirky remark.
While some reviews are glowing, critical, or even humorous, others offer invaluable insights for improvement. These might include performance optimizations, user interface enhancements, and bug fixes. However, the challenge lies in distilling this vast sea of feedback into actionable projects that can enhance the app. Although mining for specific topics or performing sentiment analysis can provide some direction, efficiently summarizing this data to uncover common trends and crucial features remains a difficult task.
Enter Pinecone’s vector database.Â
By tokenizing and storing reviews in a vector database, we can leverage retrieval-augmented generation (RAG) to pinpoint relevant reviews. Additionally, with the intelligence of a large language model (LLM) like ChatGPT, we can transform this extensive collection of reviews into a concise set of feature enhancement projects. These projects will consist of actionable tasks aimed at improving our app, all grounded in the wealth of knowledge embedded in user feedback.Â
This approach not only saves time but also ensures that our app evolves in alignment with user needs and preferences.
For this blog, we are excited to share a RAG solution using Canopy, the new RAG framework powered by Pinecone, one of the most widely used and well-regarded vector databases in existence. The remainder of the blog will guide you through the process and explain the benefits of using Canopy to meet your RAG chatbot needs.Â
By the end of this post, you will be able to analyze recent reviews and get actionable app improvement projects and task plans while utilizing Pinecone, Canopy, and an OpenAI LLM!
Configuring Canopy
Canopy streamlines the process of tokenizing our data, setting up an index in the vector database, pushing our data to the vector database, querying the DB, and prompting an LLM. Once you have specified your API keys from Pinecone and OpenAI in your environment, Canopy handles the rest by managing connections to the remote services of managing your vector databases and talking to your chosen LLMs.
Note: If you don’t already have a Pinecone API key, you can sign up for a free key. This comes with a serverless vector database with 2GB of storage and access to the us-east-1
server.
Canopy manages its configuration parameters via a YAML file, which can be created by a built-in tool. The AWS region may default to a different set of servers than what your license requires, so it is a good idea to create the configuration file yourself and ensure the tool knows which region it should connect to. This can be done by editing your ./config/default.yaml
file in your code editor of choice and ensuring that us-east-1Â is specified for the AWS region, as is shown below:
create_index_params:
# --------------------------------------------------------------------
# Initialization parameters to be passed to create a canopy index. These parameters will
# be used when running "canopy new".
# -----------------------------------------------------------------------
metric: cosine
spec:
serverless:
cloud: aws
region: us-east-1
Obtain and Preprocess Data
We have used the Kaggle Dataset Amazon Shopping Reviews [updated daily] for our data. Major thanks to the Kaggle user Ashish Kumar for compiling and uploading this dataset.
The data consists of a CSV with columns for the text of the review, the user ID, the date and time that the review was posted, and some more metadata. In order to upsert the reviews into the vector database, we must ensure that they fit the Canopy Document class schema. Canopy Document containers expect the schema of the input text to follow this format:
id: str, unique identifier for the text
text: str, the text to insert into the database
source: str, where the text came from (i.e. URL)
metadata: dictionary, key-value pairs of other relevant information
One tip worth mentioning is that encoding as much of the string metadata as possible in meaningful numeric formats can make the retrieval process much easier. For example, we can extract the year and month of each review into their own metadata keys, which will streamline the RAG query process and ensure our LLM only needs to interact with the data we are interested in. Preprocessing and encoding your data in this way can be a superpower when building RAG pipelines!
Introducing the Knowledge Base
Canopy introduces a knowledge base that stores and retrieves text documents efficiently using a Pinecone index. It works by breaking each document into smaller segments according to its textual structure, such as Markdown or HTML. These segments are then converted into vectors using an embedding model, which can then be stored in the Pinecone index.
Once we have instantiated and populated the index in the vector database, we can submit a textual query to identify and return documents that are similar in this vector space.
It is a breeze to set up the knowledge base. We simply need to initialize the Tokenizer and call a single function, and Canopy handles the rest: creating a new Pinecone index, connecting to the knowledge base, setting up the schema, and preparing for document upload.
The KnowledgeBase
method create_canopy_index
performs much of the heavy lifting on our behalf, though we can help it by passing a spec using the ServerlessSpec
container. This is how we let Canopy know that we want our index in the us-east-1Â region, which matches our free Pinecone license.
After creating the index, you can view it in the Pinecone console. Next, we move on to populating our vector database.
Upsert Data
Upserting data to a vector database can be a tricky and tedious process if done manually. The schema must be honored. The text must be chunked and tokenized consistently and accurately so that it can be encoded as vectors in the DB while preserving semantic meaning. And these mappings must be tracked and carried around any time the vector database is needed to be interacted with.Â
Luckily, Canopy handles all of these for us. As long as we format it in a list of the intuitive and easy-to-use Document
containers, we can simply call the upsert
method attached to our Knowledge Bse and Canopy handles the rest.
from canopy.models.data_models import Document
documents = [Document(**row) for _, row in processed_df.iterrows()]
This cell uses a list comprehension over the iterator of rows in the processed dataframe from above to create the Document
containers that we need. The documents themselves can be viewed, and we can see that everything is formatted according to the schema that Canopy requires:
Document(id='12f4d71d-c57b-4fb5-8756-2c23d8c13441', text="This has always been a great app but now I can't change the number of items, I can't delete items or scroll down my saved items without it jumping back to the top. Very frustrating.", source='Alison Blackstone', metadata={'score': 3, 'thumbsUpCount': 0, 'reviewCreatedVersion': '28.10.0.100', 'at': '2024-05-19 20:25:46', 'appVersion': '28.10.0.100', 'year': 2024, 'month': 5})
Now we can populate our database! This process can take several minutes, depending on the number of records or text documents that need to be uploaded and the size of them. Since it takes a while, we can group the documents into batches and use tqdm
to display a progress bar. On our machine, this took about 13 minutes to populate the 50,000 records in the vector database.
from tqdm.auto import tqdm
# select a "reasonable" batch size, this is mostly for visual feedback
# of upsert processing. However, choosing one that is too small will actually
# make it take longer
batch_size = 100
for i in tqdm(range(0, len(documents), batch_size)):
kb.upsert(documents[i: i+batch_size])
Query the Knowledge Base
Just as in the case of formatting and uploading data, Canopy handles all the painful necessities of formatting queries so that we can focus on the interesting stuff, like how to get relevant data for our application.Â
Canopy provides a Query
object, which gives us an interface for formatting a text query in the language that the vector database speaks, along with metadata filtering capabilities. For example, if we wish to find well-liked reviews from this year, we can do so with a query like this:
results = kb.query(
[Query(
text="buy",
metadata_filter={
"thumbsUpCount": {"$gt": 10},
"year": {"$gte": 2024}
}
)]
)
Another useful aspect of the query functionality in the Canopy knowledge base is the ability to filter responses based on the metadata. Here, we’re considering only those reviews that have more than 10 likes from other users and are from 2024 or later. In our thought experiment, this could be a way for a developer to search for reviews that correspond to a recent release of the app and which are issues that appear to affect many users.
The API for comparisons in the metadata uses special keyphrases which are passed to the underlying GRPC engine. “$gt” means greater than, and “$gte” stands for greater or equal to some specified parameter. In English, "year": {"$gte": 2024}
means find reviews from 2024 or later.Â
A few other useful operators include “$eq“ (equal to), “$lt“ (less than), and “$lte“ (less than or equal to). Extracting the month and year, as we did in the Obtain and Preprocess Data section, allows us to filter in this way. Thinking about our data and application in this way helps perform targeted RAG queries and ultimately produces better results with the LLM.
How to Chat With Our Data
Now for the fun part – let’s set up a Context Engine and Chat Engine and start talking to our data!
ContextEngine
The ContextEngine provides context to the LLM based on a desired set of search queries. It is cognizant of token usage, making it possible to constrain the queries to a token length that matches your budgetary and model needs. The ContextEngine works by querying the Knowledge Base for relevant documents according to the text specified by the ChatEngine (see below) and metadata filters that the user specifies. This context is then built and injected into the LLM.Â
Again, all of this functionality is provided under the hood, allowing us to focus on our final application and talking to our relevant data.
context_engine = ContextEngine(
kb,
global_metadata_filter={
"thumbsUpCount": {"$gt": 10},
"year": {"$gte": 2024}
}
)
ChatEngine
Now for the exciting part, the final piece in building an end-to-end chat agent to be able to talk to our dataset! Canopy makes it a breeze by providing ChatEngine
as a batteries-included RAG-based chatbot. We will wrap GPT-3.5 from OpenAI while enhancing its performance with our carefully filtered, relevant results based on our vector database.Â
With Canopy, we get automatically phrased search queries from the chat history, which are sent to the knowledge base, saving us time and energy.
from canopy.chat_engine import ChatEngine
chat_engine = ChatEngine(context_engine)
It is that easy to configure the chat engine.
We can make a programmatic chatbot in Python by using the instance of the ChatEngine
that we created and providing a way for the function to remember our chat history when updating its response. We do this by storing the memory of the chat in a list and passing it to a utility function, which we define below:
from typing import Tuple
from canopy.models.data_models import Messages, UserMessage, AssistantMessage
def chat(new_message: str, history: Messages) -> Tuple[str, Messages]:
messages = history + [UserMessage(content=new_message)]
response = chat_engine.chat(messages)
assistant_response = response.choices[0].message.content
return assistant_response, messages + [AssistantMessage(content=assistant_response)]
Now, all we have to do is specify a chat prompt and use our RAG-guided queries to help us understand what our users have been experiencing this year with our app!
from IPython.display import display, Markdown
history = []
response, history = chat("What are users saying about the app?", history)
display(Markdown(response))
Users have expressed a variety of opinions and feedback about the app. Some users find the app slow, cluttered, and inefficient, with issues like freezing, crashing, and difficulty in viewing reviews (Source: Tony Chun, akumarie K, Steph Stamm). Others mention concerns about irrelevant AI features, excessive ads, and misleading reviews within the app (Source: P. B., Raymond Raymond).Â
Some users appreciate the app’s AI feature for summarizing reviews and find it helpful for purchases (Source: akumarie K). Overall, users have mixed opinions, with some praising the app for its reliability and deals (Source: Elianna Miller), but many point out issues like clutter, lack of usability, and bloating after updates (Source: James S, Zech Cowles, William).
The LLM can tell us that users have mixed opinions about the recent app updates and even cite its sources using the queried results from our vector database.
How about if we want the LLM to devise a set of app improvement projects based on user reviews?
chat_query = "Based on the reviews, please create a plan for improving the presentation"
chat_query += " and functionality of the app. "
chat_query += " Define three projects and give a short outline of work for each project."
chat_query += " Each project should be scoped in a focused way, with achievable results."
chat_query += " Provide specific, actionable tasks. Make estimates for the project in terms"
chat_query += " of time taken to complete the task, and rule out any that take longer than"
chat_query += " 12 weeks or require more types of workers than front end or back end engineers."
response, history = chat(chat_query, history)
display(Markdown(response))
And the results:
App Improvement Projects:
UI/UX Optimization Project:
Outline: Enhance the user interface and experience to make the app more visually appealing, intuitive, and optimized for performance.
Tasks:
Conduct a comprehensive UI/UX audit to identify pain points and areas for improvement (2 weeks).
Implement a flexible search bar that hides when scrolling down and reappears when needed (front end engineer, 4 weeks).
Introduce visual accessibility features like different color schemes and text sizes within the app, not just reliant on phone settings (front end engineer, 3 weeks).
Implement a ‘saved for later’ search feature for easier navigation and quick access to saved items (back end engineer, 3 weeks).
Performance Enhancement Project:
Outline: Improve app performance by addressing slow loading times, crashing issues, and clunky interface elements.
Tasks:
Conduct a performance analysis to identify and address bottlenecks causing lag and slow loading (2 weeks).
Optimize video playback functionality on product pages to ensure smooth performance (back end engineer, 4 weeks).
Address UI elements and animations jitteriness during video playback by optimizing code (front end engineer, 3 weeks).
Refactor and optimize codebase for overall app speed and efficiency (back end engineer, 4 weeks).
Feature Expansion Project:
Outline: Enhance app functionality by adding features like dark mode, recommended items page, and improved list organization.
Tasks:
Introduce a dark mode option for users to reduce eye strain and improve user experience (front end engineer, 3 weeks).
Develop a ‘Recommended’ page that suggests items based on user’s browsing and purchase history (back end engineer, 4 weeks).
Implement advanced list organization options allowing users to select multiple items and assign them to specific lists (back end engineer, 3 weeks).
Conduct user testing for feedback on new features and make iterative improvements based on user responses (2 weeks).
By focusing on these projects, the app can significantly enhance its presentation, functionality, and overall user experience within a reasonable timeframe and resource allocation.
Conclusion
Utilizing Pinecone, Canopy, and an LLM can revolutionize the way you approach data analysis by transforming the overwhelming task of analyzing vast amounts of textual data into a manageable and productive process. Whether you’re dealing with app reviews, customer feedback, or any other corpus of written documents, this innovative solution makes it easier than ever to extract and act upon critical insights.Â
If you’d like help implementing this solution for your organization, the experts at phData can help! Reach out to us today or attend one of our free Generative AI Workshops for honest advice and expert help managing unstructured text better.