Natural language understanding: Create your own language model

Oct. 31, 2023   

4 min read

This is part of a four-part series highlighting work selected for presentation by Walmart technologists at the 2023 Grace Hopper Conference. This article includes insights from Walmart NLP experts Brenda Zhang, Dikla Karty, Setu Shah and Deepa Mohan.

Technology today is not just advancing, it is accelerating. And nowhere has this been more evident than in the seemingly overnight adoption of ChatGPT, a Large Language Model (LLM) that can answer complex queries in a matter of seconds. LLMs are a game changer in the world of artificial intelligence, as well as marketing. Marketers are finding immense value in using advanced Natural Language Processing (NLP) techniques for a myriad of applications. But what is it, really? And how can marketers use this technology?

NLP and Advanced LLMs

Natural Language Processing, at its core, is the intersection of computational linguistics and artificial intelligence. This is not just a function of parsing sentences or responding to queries like a search engine, but rather answering complex questions and comprehending context. Imagine teaching a machine the intricacies of the English language—idioms, sarcasm, cultural nuances and our ever-evolving vocabulary. That is where advanced models come in.

A motherboard with a button with the word ‘AI’ on it. There are two chat symbols hovering above the button.

LLMs are based on a neural network structure called 'Transformers' to process words in relation to all other words in a sentence rather than in a sequence. This enables the model to capture context more effectively.

The ‘pre-trained’ aspect means that the model is initially trained on vast amounts of text from the internet—learning grammar, facts about the world and some reasoning abilities. However, it is the ‘fine-tuning’ on specific datasets that allows LLMs to excel at tasks.

Preprocessing natural language data

To fine-tune an LLM, the dataset must be structured in a way that is comprehensible to the model. Here’s how that can be done:

An infographic showing the steps to get preprocessed text ready for analysis. The process starts from integrating data from multiple sources followed by tokenization, lowercasing, removing stop words and finally stemming or lemmatization.
  • Tokenization: Break down text into chunks, or tokens. These could be as short as characters or as long as words
  • Lowercasing: Convert all characters in your text to lowercase to maintain consistency
  • Removing ‘stop’ words: Words like ‘and’, ‘the’ and ‘is’ can often be filtered out as they occur frequently but carry little semantic weight
  • Stemming/Lemmatization: This involves reducing words to their base or root form. For example, ‘running’ becomes ‘run’

Fine-tuning an LLM for specific tasks

While LLMs come pre-trained, fine-tuning it on specific datasets allows for specialized applications. For marketers, this could mean tuning it on customer reviews for sentiment analysis or on product descriptions for automated content generation.

Fine-tuning does not mean starting from scratch. Start with the knowledge the language model already has; then, in exposing it to a dataset, the model can adjust its weights and biases to better predict the next word in your specific context.

Evaluation is also an ongoing and iterative process. In other words, as the model learns, so should the human. Those who evaluated the model should look at performance metrics such as ‘loss’, which shows how well the model’s predictions align with the actual outcomes. Practical tests might include tasks like generating ad copy or responding to customer queries. The accuracy, relevance and fluency of the outputs will provide a real-world performance measure.

For marketers, NLP—especially large language models—points to a future yet to be fully realized. From automated content generation to sentiment analysis, the applications are vast. Creating a language model might seem complex, but with the right tools and knowledge, anything is possible.