Turn Text Into Structured Data Using JavaScript & OpenAI's GPT

Human languages are incredibly diverse and can describe an infinite amount of things, but most computer programs are not able to understand texts, at least not without preprocessing.

You can do very useful things once you get structured data out of large pieces of text:

Perform statistical analyses
Analyse sentiments
Build product features, such as summarisation
Extract properties to store in your data warehouse

This used to be much harder before the era of LLMs, and required more sophisticated knowledge of Natural Language Processing techniques. GPT and other models have made analysing large texts really accessible. It is worth noting that results are not always reliable, but the same caveat holds true with many other text processing techniques.

Let's look at a real example that will show you how to convert unstructured data into a useful structured format.

Extract Best Dishes From Restaurant Reviews

Let's say you're working with a chain of pizzerias and they want to figure out the dish that customers love the most. They have thousands of customer reviews that they can draw insights from, but it would take ages for a human to read everything and report back the results.

This type of problem is solvable using NLP methods such as NER (Named Entity Recognition) or other forms of rule-based matching. LLMs can make these problems a lot easier to solve, because you only need to:

Write a good prompt.
Call an API with the prompt and the data you want processed.

Let's take it step-by-step

Note: If you'd like to follow along, you can find 20 example reviews in this GitHub Gist. In our code we only have a few to keep things short.

const reviews = [
  {
    reviewText:
      "Absolutely loved the Margherita pizza here! The crust was just perfect.",
  },
  {
    reviewText:
      "The spaghetti carbonara was a bit too salty for my taste. Might try something else next time. The garlic bread was really good though.",
  }
  // ... find more reviews here: https://gist.github.com/arisp8/db7a2516e40ca9b84b18a6a19da36962
];

If you read through the reviews, you can see that some of them mention multiple dishes. Additionally, people might spell the same dish in many different ways (e.g. "Margherita pizza" VS "Pizza Margarita" VS "margarita")

Our goal is to find the most popular dish that people keep clamouring for so we can feed that back in our report.

First, experiment with prompts

Let's break down what I'm doing here:

First, I'm providing some context so that GPT knows what I'm trying to achieve. I am very specific in telling that scores should be negative, neutral, or positive.
I provide an example JSON of how I want the data to look like. This is important because otherwise you risk getting an inconsistent structure that wouldn't be easy to work with.
I note that the response should only include JSON, and no text or explanations.
And then I include the review text that I want to process.

And this is the result:

Now that I found a prompt that works, I will use the OpenAI package inside Node.js to get the results in a programmatic way.

Get results through the OpenAI API

I'll use the latest GPT3.5-turbo model, because it offers the lowest prices while being generally reliable.

I'm starting a new Node project and I'll use the official OpenAI NPM package to get chat completions. (The full guide on how to use chat completions can be found here)

First, you'll need to import the openai package and set your API key.

import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: "your-api-key-here",
});

Now we need to create a system prompt. That is a set of instructions that explains what sort of output we're expecting.

const systemPrompt = `I am doing a large scale analysis of pizzeria reviews.
I want you to extract the dishes that customers mention in their reviews
and give them a score that is "negative", "neutral", or "positive"
depending on whether customers liked the dish or not. Dishes must be
things that you expect to see in a menu (Pizza, Pasta) but not things
like "water", "napkins", "crust" or specific attributes of the dishes (flavour, sauce, etc.).

Use this format for the resulting JSON object:
{
  "dishes": [
      {
          "name": "[dish name]",
          "sentiment": "[sentiment]"
      }
  ]
}
`;

I took the prompt that I created earlier, and I tweaked certain things to increase the chance of getting the right results. As an example, you can see that I explained that "water" or "crust" are not considered dishes, so that we don't get sentiment scores for these.

Next, I defined a function called analyseReview that accepts a review text and returns the results in a JSON format. I used the response_format property in the API call, so that the result is guaranteed to be valid JSON.

const analyseReview = async (reviewText) => {
  const chatCompletion = await openai.chat.completions.create({
    // 💡 Important: By setting the response_format to json_object
    // we guarantee that we'll receive a response that is valid JSON.
    response_format: { type: "json_object" },
    // This model is very cheap and accepts up to 16k tokens
    model: "gpt-3.5-turbo-0125",
    messages: [
      {
        role: "system",
        // Here we pass the system prompt
        content: systemPrompt,
      },
      {
        role: "user",
        // And here we pass the review text
        content: `Review Text: ${reviewText}`,
      },
    ],
  });
  // And finally we return the JSON that is returned from the OpenAI API
  return JSON.parse(chatCompletion.choices[0].message.content);
};

And now for the final step: I loop through all the reviews and use an object to aggregate the results. This should give us an idea of how many people like each dish.

async function main() {
  const result = {};
  for (const review of reviews) {
    const analysis = await analyseReview(review.reviewText);
    analysis.dishes.forEach((dish) => {
      const key = dish.name.toLowerCase();
      if (!result[key]) {
        result[key] = { positive: 0, neutral: 0, negative: 0 };
      }
      result[key][dish.sentiment]++;
    });
  }
  console.log(result);
}

main();

I tried running it with a small sample of reviews, and the most liked dish is the Mushroom Pizza! It's impressive how a few lines of code can get you this result, without depending on NLP methods.

There are obviously various improvements that you can add to this approach - this is meant purely as a proof of concept and an idea that you can take with you if you encounter a problem of this nature.

One issue that I noticed is that "margherita pizza" and "margarita" are treated as two separate dishes, even though they're the same. To improve the accuracy of the results, you can improve upon the systemPrompt by specifying all the menu items and instructing GPT to match different spellings for each of the dishes.

At the end of they day, this a process that requires trial & error, and the example is by no means a one-size-fits-all. I suggest that you try various different prompts until you find the one that works best in your specific case.

Conclusion

You can now use the methods above to extract structured data from any piece of text, and use the data for subsequent analyses. There are tons of other examples where you can use this approach to build cool features more easily than before:

Ask GPT to extract the main talking points from a conversation or transcript.
Do SWOT (Strengths, Weaknesses, Opportunities, Threats) analysis based on a company's website content.
Classify media articles based on various attributes that you're interested in.

If you're interested in trying this out, you can find a complete code example on GitHub.