Fine-tuning a Large Language Model (LLM) on domain-specific data can transform a general AI assistant into an expert in that domain. In this two-part series, we'll walk through how to prepare a high-quality nutrition Q&A dataset from a collection of nutrition books (provided as PDFs/EPUBs) and then fine-tune an LLM to become a proficient nutrition expert. This first part covers dataset preparation – extracting text from the PDFs, chunking it by topic, formulating question-answer pairs, and packaging everything into the required JSONL format. (Part 2 will cover the actual fine-tuning process using the prepared dataset and mlx_lm commands.)
General-purpose LLMs are trained on broad internet text and may lack depth in specialized areas. By fine-tuning on expert-written nutrition books, we can imbue the model with detailed knowledge about dietary principles, healthy recipes, and food habits. The goal is to have the model confidently answer nutrition questions, suggest recipes, and create diet plans with the nuance and accuracy of a human nutritionist. Fine-tuning with domain-specific data has been shown to improve an LLM's performance on related tasks. In our case, we want the AI to internalize nutrition expertise – from explaining macronutrients to outlining balanced meal plans – all grounded in reputable sources.
Key benefits of domain fine-tuning:
By the end of this process, we expect our fine-tuned model to handle queries like "What are some high-protein breakfast options for vegetarians?" or "Can you suggest a 1500-calorie daily meal plan?" with ease and reliability.
We have a set of nutrition books (in PDF and EPUB formats) that will serve as our knowledge base. These include titles like In Defense of Food by Michael Pollan, Metabolical by Robert Lustig, Nourishing Traditions by Sally Fallon, and Whole: Rethinking the Science of Nutrition by T. Colin Campbell. These books cover a range of topics:
Why use books? Books by established authors are rich in well-structured information and often provide deeper insight than random web articles. By using multiple books, we gather a diverse set of perspectives and tips, making the fine-tuning dataset more robust. The combination of these sources will help the model not only recite facts but also offer practical advice (like recipes and habits) in a well-informed manner.
Before we can use this content, we need to extract the text from the PDF/EPUB files.
To convert the knowledge in these books into a usable text format, we'll use Python libraries for PDF and EPUB processing. Two popular options for PDF text extraction are PyMuPDF and pdfminer.six (PDFMiner). We'll focus on PyMuPDF (which is imported as fitz
) because it's fast and straightforward to use – in fact, many developers consider PyMuPDF one of the best libraries for this purpose. We should ensure our PDFs are searchable text PDFs (not just scanned images of pages), otherwise no text extractor will work.
Steps to extract text from a PDF using PyMuPDF:
fitz
. Install via pip if needed (pip install pymupdf
).fitz.open("path/to/file.pdf")
. This returns a Document object.doc.pages()
or for page_number in range(doc.page_count):
.page.get_text("text")
. This yields the page's text content.Here's a code snippet demonstrating PDF text extraction:
import fitz # PyMuPDF
pdf_path = "Michael Pollan - In Defense of Food.pdf"
doc = fitz.open(pdf_path)
all_text = ""
for page in doc: # iterate over pages
page_text = page.get_text("text")
all_text += page_text
doc.close()
print(f"Extracted {len(all_text)} characters of text.")
This will read the entire PDF and store the text in the all_text
variable. If the PDF has a large number of pages, this could be a lot of text (potentially tens or hundreds of thousands of characters).
For EPUB files, the process is a bit different since EPUBs are essentially zip archives of HTML files. You can use libraries like ebooklib
or convert EPUB to PDF/plain text. For simplicity, you might convert the EPUBs to PDF first (using an external tool) or use an EPUB-specific parser to extract text. The key goal is to get a raw text dump of each book.
Cleaning the text: Once extracted, the text may contain artifacts like page numbers, headers/footers, or hyphenation at line breaks. We should clean these up:
For example, we can strip out any lines that consist mostly of digits or all-caps (likely page numbers or section titles), although be careful not to remove legitimate acronyms. Another strategy is to use regex to find and join hyphenated words at line breaks.
Simply having a massive wall of text isn't ideal for fine-tuning. We need to break the content into meaningful chunks. Each chunk will later be turned into a Q&A pair. Chunking smartly by topic ensures each Q&A pair is coherent and focused.
We've identified four main topics to chunk by:
How to chunk by topic? There are a few approaches:
For our dataset, we will:
For example, from Nourishing Traditions we might get:
Each chunk will later be transformed into a Q&A pair, so we want each chunk to be self-contained enough to answer a specific question.
With topic-specific chunks in hand, the next step is to create a question that a user might ask and an answer derived from the chunk's content. Essentially, we're converting raw text into a conversational format suitable for instructing an LLM.
Here's how to approach Q&A creation:
By formulating the Q&A this way, we are essentially creating a synthetic dialogue where the question triggers the information found in the chunk. It's a form of supervised training data that pairs an input (user's question) with the desired output (assistant's answer), under the context of a system role.
Example Q&A pair:
Here, the chunk might have been a section from a book explaining fermented foods. We turned it into a concise answer. Notice that the answer is informative yet concise, covering 3-4 key points – this is ideal for training an LLM to produce well-structured responses.
We will repeat this process for all chunks, ensuring a diverse set of Q&As. It's important to cover a wide range of questions (factual explanations, how-tos, advice, etc.) to make the fine-tuned model versatile in handling various user needs.
Now that we have a collection of Q&A pairs, we need to format them properly for the fine-tuning process. We'll use a JSONL (JSON Lines) format, where each line is a JSON object representing one training example. The fine-tuning framework we plan to use (the mlx_lm library for LLM fine-tuning with LoRA) supports a chat format in the JSONL, meaning we can provide a list of message objects with roles.
The format for each line will be:
{
"messages": [
{"role": "system", "content": "You are a helpful nutritionist."},
{"role": "user", "content": "<User's question here>"},
{"role": "assistant", "content": "<Assistant's answer here>"}
]
}
This format follows the role-based chat convention (system/user/assistant). The MLX LLM fine-tuning tool recognizes this as a valid chat dataset entry. It's important to include the system role in every example so that the model consistently learns to respond with the perspective of a nutrition expert.
We will create three files:
train.jsonl
– the bulk of the data for training (e.g., 80% of the Q&A pairs).valid.jsonl
– a validation set for monitoring training (e.g., 10% of the pairs).test.jsonl
– an optional test set for final evaluation (e.g., 10% of the pairs).The mlx_lm fine-tuner expects these files to be present in a directory passed via --data
. Each file will contain JSONL entries as described above. We should ensure that each line is a valid JSON object and that lines are separated by newline characters without commas (since JSONL is not a single JSON array but individual objects per line).
Example of one JSONL line (pretty-printed for readability):
{
"messages": [
{ "role": "system", "content": "You are a helpful nutritionist." },
{ "role": "user", "content": "How can I increase my protein intake in a vegetarian diet?" },
{ "role": "assistant", "content": "Increasing protein in a vegetarian diet is absolutely doable. Focus on plant-based protein sources like beans, lentils, tofu, and tempeh. For example, lentils and beans are rich in protein and also provide fiber and minerals. Incorporating foods like quinoa, which is a complete protein grain, or nuts and seeds such as almonds and chia seeds can also boost your protein intake. Additionally, consider dairy or dairy alternatives enriched with protein, like Greek yogurt or soy milk. By including a variety of these foods in your meals, you can meet your protein needs without any meat." }
]
}
In the actual train.jsonl
file, the above would be one line (minified without extra spaces or line breaks). We will generate such lines for every Q&A pair.
Notice how the assistant's answer is detailed and paragraphed. We can allow newline characters in the content if needed (JSON strings can include \n
), but generally keeping the answer as a single paragraph (with proper sentence breaks) is fine.
Now, let's put all the pieces into a single Python script. This script will:
Complete Python Script:
import fitz
import json
import random
# 1. Load and extract text from all books
books = [
{"path": "Michael Pollan - In Defense of Food_ An Eater's Manifesto-Penguin Press HC, The (2008).pdf", "format": "pdf"},
{"path": "Robert H. Lustig - Metabolical_ The Lure and the Lies of Processed Food, Nutrition, and Modern Medicine-HarperCollins (2021).epub", "format": "epub"},
{"path": "Sally Fallon - Nourishing Traditions_ The Cookbook that Challenges Politically Correct Nutrition and the Diet Dictocrats (1999).pdf", "format": "pdf"},
{"path": "T. Colin Campbell, Howard Jacobson - Whole_ Rethinking the Science of Nutrition-BenBella Books (2014).epub", "format": "epub"}
]
all_text = ""
for book in books:
path = book["path"]
fmt = book["format"]
text = ""
if fmt == "pdf":
try:
doc = fitz.open(path)
for page in doc:
text += page.get_text("text")
doc.close()
except Exception as e:
print(f"Error reading {path}: {e}")
elif fmt == "epub":
# Simplified EPUB handling: convert to text by reading file content or skipping for demo
try:
with open(path, 'r', encoding='utf-8', errors='ignore') as epub_file:
text += epub_file.read()
except Exception as e:
print(f"Note: Could not directly read EPUB {path}. (In practice, convert EPUB to text)")
# Append to cumulative text with a separator for safety
all_text += "\n" + text
# 2. Clean the text
def clean_text(raw):
cleaned = raw.replace('\r', ' ').replace('\n', ' ')
# Remove double spaces
cleaned = ' '.join(cleaned.split())
return cleaned
all_text = clean_text(all_text)
# 3. Chunk text by topic
topics_keywords = {
"Nutrition Basics": ["nutrition", "nutrient", "vitamin", "mineral", "protein", "carbohydrate", "fat", "fiber"],
"Healthy Diets": ["diet", "dietary", "Mediterranean", "keto", "low-carb", "vegetarian", "healthy eating", "whole food"],
"Recipes": ["Ingredients:", "Directions:", "Recipe", "serves", "preheat oven", "mixing bowl", "minced", "chopped"],
"Good Food Habits": ["habit", "lifestyle", "eat slowly", "regularly", "avoid", "moderation", "snacking", "meal plan"]
}
chunk_size = 1000 # target chunk size in characters (adjust as needed)
chunks = []
# Naively split the text into rough chunks of chunk_size characters
for i in range(0, len(all_text), chunk_size):
chunk = all_text[i:i+chunk_size]
# Ensure we don't cut in the middle of a sentence:
# find last period in the chunk
last_period = chunk.rfind('.')
if last_period != -1 and last_period > len(chunk) * 0.5:
# cut the chunk at the last period (end of sentence)
chunk = chunk[:last_period+1]
i = i + last_period + 1 # move index to end of this sentence for next chunk (this logic is simplified for demonstration)
chunks.append(chunk)
# Assign each chunk a topic based on keyword occurrences
chunk_topic_pairs = []
for chunk in chunks:
text_lower = chunk.lower()
best_topic = None
best_count = 0
for topic, keywords in topics_keywords.items():
count = sum(text_lower.count(word.lower()) for word in keywords)
if count > best_count:
best_count = count
best_topic = topic
if best_topic is None:
best_topic = "Nutrition Basics" # default fallback
chunk_topic_pairs.append((best_topic, chunk))
# 4. Generate Q&A for each chunk
qa_data = []
for topic, chunk in chunk_topic_pairs:
# form a question based on the topic
if topic == "Nutrition Basics":
question = "Can you explain some fundamentals of nutrition?"
elif topic == "Healthy Diets":
question = "What are some principles of a healthy diet?"
elif topic == "Recipes":
# Try to use a hint from the chunk for recipe name or main ingredient
if "soup" in chunk.lower():
question = "Can you share a healthy soup recipe?"
elif "salad" in chunk.lower():
question = "How do I make a nutritious salad?"
else:
question = "Could you give me a healthy recipe example?"
elif topic == "Good Food Habits":
question = "What are some good food habits I should follow?"
else:
question = "I have a nutrition question, can you help?" # fallback, should not usually hit
# form an answer based on the chunk content (simplified: use the chunk text itself or a summary)
answer = chunk.strip()
# Ensure answer is not too long or empty
if len(answer) > 1000:
answer = answer[:1000] + "..."
if answer == "":
answer = "I'm sorry, I don't have information on that."
# Construct the message structure
qa_entry = {
"messages": [
{"role": "system", "content": "You are a helpful nutritionist."},
{"role": "user", "content": question},
{"role": "assistant", "content": answer}
]
}
qa_data.append(qa_entry)
# 5. Shuffle and split into train/val/test
random.shuffle(qa_data)
N = len(qa_data)
train_split = int(0.8 * N)
val_split = int(0.9 * N)
train_data = qa_data[:train_split]
val_data = qa_data[train_split:val_split]
test_data = qa_data[val_split:]
# 6. Save to JSONL files
def save_jsonl(filename, data):
with open(filename, 'w', encoding='utf-8') as f:
for entry in data:
json_line = json.dumps(entry, ensure_ascii=False)
f.write(json_line + "\n")
save_jsonl("train.jsonl", train_data)
save_jsonl("valid.jsonl", val_data)
save_jsonl("test.jsonl", test_data)
print(f"Saved {len(train_data)} training samples, {len(val_data)} validation samples, {len(test_data)} test samples.")
Let's break down a few key parts of this script:
topics_keywords
dictionary to help classify chunks. This is a simple heuristic: count keyword occurrences to decide topic. (In a more refined approach, one might actually parse the books' structure or use an LLM to label chunks, but keywords suffice here.)chunk_size
(1000 characters) to cut the text. We also included a check to not cut off mid-sentence by looking for the last period in the chunk.train.jsonl
, valid.jsonl
, and test.jsonl
. Each line in these files is a compact JSON string as required.After running this script, we'll have our dataset files ready for fine-tuning. For example, the first few lines of train.jsonl
might look like (formatted here for clarity):
{"messages": [{"role": "system", "content": "You are a helpful nutritionist."},
{"role": "user", "content": "How do I make a nutritious salad?"},
{"role": "assistant", "content": "To make a nutritious salad, start with a base of fresh greens such as spinach or kale... (etc)"}]}
{"messages": [{"role": "system", "content": "You are a helpful nutritionist."},
{"role": "user", "content": "What are some good food habits I should follow?"},
{"role": "assistant", "content": "Developing good food habits can significantly improve your health. First, try to eat meals at regular times each day to stabilize your blood sugar... (etc)"}]}
{"messages": [{"role": "system", "content": "You are a helpful nutritionist."},
{"role": "user", "content": "Can you explain some fundamentals of nutrition?"},
{"role": "assistant", "content": "Fundamentals of nutrition include understanding the main nutrients your body needs: carbohydrates, proteins, and fats for energy and body functions, along with vitamins and minerals that support various processes... (etc)"}]}
Each line is a self-contained conversation. Notice that the system message is repeated every time; that's intentional so that each training example is independently understandable to the model with the correct context.
It's crucial that we have separate training and validation sets. The training set is what the model learns from, while the validation set is used to monitor performance and avoid overfitting (the model should not see validation examples during training). We also set aside a test set to evaluate the model after fine-tuning, to see how well it generalizes to unseen questions.
Our script split the data in an 80/10/10 ratio by default. If the total number of Q&A pairs is small (say a few dozen), you might opt for 90/10 (train/valid only) and do a manual evaluation instead of a formal test set. However, given we have content from multiple books, we could easily generate dozens or hundreds of Q&A pairs, so a test set is useful.
With the dataset prepared, we are ready to move on to fine-tuning. In Part 2, we'll cover how to use mlx_lm
(an open-source LLM fine-tuning tool) to fine-tune a model like Llama-2 (for example) using these JSONL files. We'll walk through selecting a base model, configuring the fine-tuning run (using LoRA to finetune efficiently), and then training and evaluating our new Nutrition Expert LLM. We'll also demonstrate using the mlx_lm
commands (like mlx_lm.lora --model ... --data ... --train
) to kick off the fine-tuning, and how to use the resulting model.
By the end of this series, you should have a clear blueprint for turning any collection of texts (not just nutrition) into a fine-tuning dataset and adapting an LLM to become a domain-specific expert. Stay tuned for Part 2, where we bring our nutrition dataset to life through model fine-tuning!