Create ChatGPT for Your University
So, let’s get started!
Step 1: Data Transformation
To begin, generating meaningful data is crucial. Simply removing HTML tags may not always result in clean and relevant text.
For instance, take a look at one of the teacher profile in my institute:
if a teacher profile is scraped in the format of “Dr. Nasir A. Afghan work experience: Assistant Professor….”, it will not be able to provide accurate information when we asked to the chatbot about the teacher’s work experience within a large dataset. To avoid such issues, it is necessary to present the information in a sentence that mimics human communication, such as “Dr. Nasir A. Afghan has experience as an Assistant Professor…”.
Therefore, it is vital to transform the scraped data into a format that is easily understood by humans.
Step 2: Embedding the Data
It may seem logical to train our model on the entire dataset at once, but this is impractical and not a wise decision. Particularly, when building chatbots for specific domains, such as universities or companies, it is better to use the GPT embedding technique. This approach involves finding the nearest relevant information to the asked question and then answering based on that information. It is more powerful and less expensive than training the model on the entire dataset. Embedding is achieved by breaking the dataset into smaller chunks that can be converted into vectors.
For example, in my dataset, consider the first few rows of information about a particular teacher. To prevent the algorithm from returning the entire profile data when searching for relevant information, the data must be broken into chunks. This way, only the necessary and pertinent information is extracted.
Step 3: Working with GPT API
Suppose we ask the question “What are the books written by Dr. Ashghar written?”
In this scenario, the algorithm will use cosine similarity to find the nearest relevant information, which in this case would be “Dr. Ashghar has written the following books…”. This information will then be used as the context when the question is passed to the GPT prompt, along with the question itself. Finally, the GPT model will provide an answer.
Step 4: Cost Calculation
New users of OpenAI are offered $5 as a signup bonus. The second step of the process, embedding, is so cost-effective that it can be used to embed a large dataset of up to GB in size for less than $5. However, it is important to keep in mind that the real cost of the project comes from generating answers based on the closest information and the question itself. It is necessary to repeatedly ask questions to evaluate the approach in a human-like manner. If you are working on a project that uses a large dataset, it is essential to consider the types of questions users are likely to ask. Embedding is a one-time cost, and if the dataset size increases, only the new data embedding needs to be calculated. In my case, I was able to embed my entire university website for less than $5. However, the remaining amount was only sufficient for me to ask a few questions.
Step 5: Creating a frontend
A frontend has no cost until you are not serious about it, one free option is to use streamlit, it provides a free cloud platform to host your chatbot webapp.
This is what my frontend app looks like, quite similar to ChatGPT
Ref
Internet
Hết.