Finding Company Details through Large Language Model (LLM)-Powered Assistant Using LangChain, ChromaDB, and Retrieval-Augmented Generation (RAG)

By John Ivan Diaz

A simple Large Language Model (LLM)-powered chatbot that helps users find information about companies. It uses the People Data Labs 2019 Global Company Dataset from Kaggle, the SentenceTransformers All-MiniLM-L6-v2 embedding model, and the Meta Llama 3.2 1B Instruct LLM from Hugging Face. While the chatbot is simple and retrieves only a small fraction of the dataset, its goal is to demonstrate LangChain, ChromaDB, and Retrieval-Augmented Generation (RAG) for LLM orchestration, vector storage, and retrieval.

Large Language Models (LLMs) LangChain ChromaDB Retrieval-Augmented Generation (RAG) Prompt Engineering

See source code

Discovery Phase

Use Case Definition

Natural Language Processing (NLP) enables computers to understand the semantic meaning of text. Traditional NLP uses techniques like rule-based methods and statistical models, which, although useful, often compromise flexibility and accuracy. Advancements in Large Language Models (LLMs) have enabled more natural and context-aware understanding through Deep Learning and Transformers.

This project aims to create a simple chatbot that helps users find information about companies using Large Language Models. It accepts questions from users, retrieves context from a dataset containing company information, and responds with answers grounded in the retrieved information. The goal is to keep the chatbot simple while demonstrating the use of frameworks such as LangChain, ChromaDB, and Retrieval-Augmented Generation (RAG).

Data Exploration

The project used the People Data Labs 2019 Global Company Dataset from Kaggle. It comprises over 7 million CSV records from companies, including domain, year founded, industry, size range, locality, country, LinkedIn URL, and current number of employees.

Sample row data:

name: IBM  
domain: ibm.com  
year founded: 1911  
industry: Information Technology and Services  
size range: 10001+  
locality: New York, New York, United States  
country: United States  
linkedin url: linkedin.com/company/ibm  
current employee estimate: 274,047  
total employee estimate: 716,906

For a lightweight demonstration, the project retrieved only 100,000 rows of data.

Architecture and Algorithm Selection

The project used the SentenceTransformers All-MiniLM-L6-v2 embedding model and the Meta Llama 3.2 1B Instruct LLM from Hugging Face.

Development Phase

Data Pipeline Creation

Dataset Ingestion

The dataset is preprocessed using a text splitter to break long text into smaller chunks. Each chunk of text is tokenized, normalized, and converted into embeddings using the Embedding Model. The embeddings are stored in ChromaDB.

Inference

The user types a question in the Gradio GUI. This input text is tokenized, normalized, and converted into embeddings using the same embedding model. These embeddings are compared with the stored dataset embeddings in ChromaDB using Retrieval-Augmented Generation to retrieve the most relevant texts. The retrieved texts are passed to the large language model to generate a natural language response. This response is displayed back to the user in the Gradio GUI.

Text splitting, embedding calls, and ChromaDB integration are handled using the LangChain framework.

Evaluation

Sample Prompts and Responses

Groud Truth

Tencent based in Shenzhen, Guangdong, China:

name: Tencent  
domain: tencent.com  
year founded: 1998  
industry: Internet  
size range: 10001+  
locality: Shenzhen, Guangdong, China  
country: China  
linkedin url: linkedin.com/company/tencent  
current employee estimate: 37,574  
total employee estimate: 42,617

Presentation of Results

Guide for Local Testing

Download the dataset from Kaggle

Go to dataset

Create a folder named "data" and place the dataset in it

Clone the repository from GitHub

Go to repository

Create a virtual environment

python -m venv venv

Activate the virtual environment

source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Create a .env file and place your HuggingFace Access Token in it

HUGGINGFACEHUB_API_TOKEN=hf...

Note: The project uses meta-llama/Llama-3.2-1B. Make sure to have access from repo authors.

Ingest the dataset

python ingest-database.py

Run inference

python chatbot.py