How to Build a LoRA Training Dataset from Scratch (2026)
How to Build a LoRA Training Dataset from Scratch (2026)
The quality of your LoRA comes almost entirely from the quality of your dataset. A mediocre trainer running a well-built dataset will produce better results than a perfectly configured Kohya run on noisy, inconsistently tagged images.
This guide covers the full dataset pipeline: image count, selection criteria, resolution, folder structure, tagging strategy, and the specific mistakes that show up in bad LoRA output.
How Many Images Do You Actually Need
The right number depends on the type of LoRA you are training.
Character LoRA: 15 to 30 images is the sweet spot. You want enough variety in pose, expression, and lighting that the model learns the character rather than memorizing specific training images. More than 50 adds diminishing returns unless the character has many distinct visual states.
Style LoRA: 50 to 150 images. Art style is more distributed across images than a character, so you need more examples to capture the full range.
Concept or object LoRA: 10 to 25 images. You are teaching the model what a specific thing looks like, which requires less variety.
You can train a usable LoRA with as few as 10 images. Fewer than that and the model overfits to specific images rather than learning the general concept.
Image Selection Criteria
Consistency of the subject. For a character LoRA, every image should contain that character, ideally filling most of the frame. Cut images where the character is in the background or partially obscured.
Variety of everything else. Poses, angles, expressions, lighting, backgrounds. If all your training images show the character from the front with the same expression, the LoRA will resist generating anything else. Include full-body shots, close-ups, and different poses.
Visual quality. Blurry images and heavy compression artifacts degrade output. Use the sharpest, highest-resolution images you can find.
No text overlays, watermarks, or UI elements. The model will learn these too.
Resolution and Cropping
Resolution requirements by base model:
- SD 1.5: 512x512 pixels. Most trainers also support aspect ratio buckets (512x768, 768x512).
- SDXL and SD 3.5: 1024x1024. Aspect buckets are common.
- Flux: 1024x1024 minimum. Handles non-square resolutions well.
For cropping, the goal is to center the subject and remove irrelevant background without awkward cuts. Birme handles batch resizing in the browser. ImageMagick is faster for large datasets via CLI.
Manual review after cropping is worth the time. Automated crops get a percentage wrong.
Folder Structure
Kohya_ss and most trainers expect this structure:
dataset/
10_my_trigger_word/
image001.png
image001.txt
image002.png
image002.txt
The folder name format is [repeat_count]_[trigger_word]. The repeat count controls how many times each image is seen per epoch. A value of 10 means each image is seen 10 times per epoch.
Each image has a matching .txt file with the same base name containing the caption for that image.
Trigger Words and Caption Strategy
The trigger word is a unique token prepended to every caption. The LoRA learns to associate it with your subject.
Choose something specific and uncommon. my_char_v1 works better than character. tide_style_lora works better than style. Generic words already exist in the base model's training data, so your LoRA competes with existing associations instead of building a clean new one.
A caption file looks like this:
my_char_v1, 1girl, brown hair, blue eyes, school uniform, smile, standing, outdoors, sunlight
Trigger word first, then descriptive tags.
What to tag: Everything you want the LoRA to control, like hair color, eye color, outfit, expression, body type.
What to skip: Specific backgrounds, generic lighting, camera angles. The less you tag these, the more flexible the LoRA in different contexts.
The Danbooru tag wiki is the reference for tag conventions. If you want a fast way to generate a starting set of tags, the free browser tagger I built runs WD14 inference without any local setup.
Common Mistakes That Ruin Results
Training on too-similar images. Ten images that are slightly cropped versions of the same source will produce a LoRA that only generates that one pose. Variety is not optional.
Inconsistent tagging. If some images tag the hair color and some do not, the model receives inconsistent signals. Decide which attributes you are tagging and apply them to every image.
Over-tagging backgrounds. If every caption includes the specific background scene, the model associates your trigger word with that background. Tag backgrounds only if they are part of what you want the LoRA to produce.
Training too long. More epochs is not better. Most character LoRAs reach a good result between 1,500 and 3,000 steps for SD 1.5. Beyond that, the LoRA overfits and loses generalizability.
Skipping weight testing. After training, test the LoRA at a few different weights (0.6, 0.8, 1.0) before deciding the run is complete. The optimal weight is almost never 1.0.
Starting Point Recap
- Collect 15-30 images for character, 50-100 for style
- Crop and resize to target resolution for your base model
- Tag each image: unique trigger word + descriptive Danbooru tags
- Organize into the
repeat_count_trigger_wordfolder structure - Train with conservative settings, test at multiple weights
The dataset step is slower than training. That is the right trade-off. A better dataset produces better results regardless of training parameters.
Work with me
Need a senior web developer?
151 projects delivered. 5★ rating. UK & EU businesses. I build custom tools, AI automation, and business systems — one-time payment, you own the code.