In this guide we are going to use the Danbooru2021 dataset by Gwern.net. You are free to use any other dataset as long as you know how to convert it to the right format.
It is recommended to have the images in 512x512 resolution and in JPG format. While the text files need to have the same name as the images it refers to.
Foe example:
````
mydataset
├── img
│ └── image001.jpg
└── txt
└── image001.txt
````
Where image001.txt has the tags (prompt) to be used for image001.jpg
## Downloading the dataset
This is optional; If you have your own dataset skip this part.
### Downloading Rsync
Danbooru2021 is available for download through rsync.
#### Linux
On Linux, you should be able to install rsync via your package manager.
````bash
apt install rsync
````
#### Windows
On Windows, you are going to need to install Cygwin, a posix runtime for Windows which allows the usage of many linux-only programs inside windows.
Next, search for "rsync" on the search bar, change "View: Pending" to "View: Full", and select on the "New" tab the latest version. Do the same for "zip".
If you want to see the entire file list, you can refer to the [Danbooru2021 information site](https://www.gwern.net/Danbooru2021).
We are going to extract the images from the 512px folder for convinience, since this folder already has the images resized to 512x512 resolution in JPG format. It only has safe rated images, for NSFW refer to [gwern.net](https://www.gwern.net/Danbooru2021#samples).
Folders from 0000 to 0009.
> The folders are named according to the last 3 digits of the image ID on danbooru. Images on folder 0001 will have its ID end on 001.
We are also going to download the only the first JSON batch. If you want to train on more data you should download more JSON batches.
Download the 512px folders from 0000 to 0009 (3.86GB):
Although we have the dataset, the metadata that explains what the image is, is inside the JSON file. In order to extract the data into individual txt files, we are going to use the script inside ``danbooru_data/local/extractfromjson_danboo21.py``