How to setup your data?¶
This tutorial explains how to setup a custom dataset to train a deep model using Aloception framework.
Goals
Prepare your data using Aloscene tools for 2D boxes prediction in classification tasks
Use a high-level tool in Alodataset to setup your data in COCO JSON format
Manually set up paths for multiple datasets.
1. Prepare your data¶
Depending on the application, there are different ways of organizing the information. It is common, in computer vision applications, to have a set of images to train, validate and/or test a model, as well as a set of annotations about the important information to use in each image. Several types of annotations allow to develop applications for detection, segmentation, interpretation and verification of image content.
Therefore, it is reasonable to think of a data structure presented below:
dataset
├── train
| ├── img_train_0.jpg
| ├── img_train_1.jpg
| ├── ...
| └── img_train_l.jpg
├── valid
| ├── img_val_0.jpg
| ├── img_val_1.jpg
| ├── ...
| └── img_val_m.jpg
├── test (optional)
| ├── img_test_0.jpg
| ├── img_test_1.jpg
| ├── ...
| └── img_test_n.jpg
└── annotations
├── ann_train
| ├── ann_img_train_0.txt
| ├── ann_img_train_1.txt
| ├── ...
| └── ann_img_train_l.txt
└── ann_valid
├── ann_img_val_0.txt
├── ann_img_val_1.txt
├── ...
└── ann_img_val_m.txt
Hint
Recent database structures implement a single annotation file for each dataset. Examples are found in COCO JSON or Tensorflow Object Detection CSV formatted databases.
Labels settings¶
In many object detection applications is necessary to classify each prediction made into different categories. For this, the Label module in Aloscene provides one way to interpret the labels:
import torch
from aloscene import Labels
example_labels = [0,1,2,0,1,0,1]
example_label_names = ["person", "cat", "dog"]
labels = Labels(
torch.tensor(example_labels).to(torch.float32),
labels_names = example_label_names,
encoding = "id" # Also, we can use "one-hot"
)
In this example, we take an encoding by ID of three objects:
person label with id = 0
cat label with id = 1
dog label with id = 2
In additional, the first box should contain a “person”, the second one a “cat”, the third one a “dog”, and so on. This corresponds to the order of the input tensor we assigned.
See also
See Label for more information of its properties and attributes.
2D boxes declaration¶
On the other hand, Aloscene implements a module to handled the boxes. This module is called BoundingBoxes2D:
import torch
from aloscene import BoundingBoxes2D
random_boxes = torch.rand(labels.size(0), 4)
# First option
boxes = BoundingBoxes2D(
random_boxes,
boxes_format="xcyc",
absolute=False,
labels=labels
)
# Second option
boxes = BoundingBoxes2D(
random_boxes,
boxes_format="xcyc",
absolute=False,
)
boxes.append_labels(labels)
For the example, a random boxes set were implement with normalized values, (x_center, y_center, width, height) as coordinates configuration and labels defined in Labels settings section.
Warning
If labels are decided, labels.size(0) == boxes.size(0).
See also
See BoundingBoxes2D for more information.
Frame implementation¶
Given the possibility of that one frame can have multiple boxes, Frame module has an attribute called
boxes2d
. For a random image, we could use Frame as follows for the previous boxes:
import torch
from aloscene import Frame
image_size = (300,300) # A random image size
frame = Frame(
torch.rand(3, *image_size),
names=("C", "H", "W"),
boxes2d=boxes,
normalization="01"
)
Hint
There are many ways to interpret the information in an image, but for purpose of this tutorial, we just implemented the boxes2d. See Frame to read more about them.
BaseDataset module¶
Base dataset is a module based on pytorch dataset class. It is able to handled a dataset based on its root directory. For this, it saves the root directory in one configuration file named aloception_config.json, saved in $HOME/.aloception folder.
Tip
If Base dataset module is used, aloception_config.json and $HOME/.aloception folder will be created automatically. However, 3. Make a config file section shows more details about that.
A general use of the module would be described by the following scheme of code:
import os
from alodataset import BaseDataset
from aloscene import Frame, BoundingBoxes2D, Labels
class CustomDataset(BaseDataset):
def __init__(self, name: str, image_folder: str, ann_folder: str, **kwargs):
super().__init__(name, **kwargs)
self.image_folder = os.path.join(self.dataset_dir, image_folder)
self.ann_folder = os.path.join(self.dataset_dir, ann_folder)
self.items = self._match_image_ann(self.img_folder, self.ann_folder)
def _match_image_ann(self, img_folder, ann_folder):
"""TODO: Perform a function to match each image with theirs annotations.
A minimal example could be the below: """
return list(zip(os.listdir(image_folder), os.listdir(ann_folder)))
def _load_image(self, id: int) -> Frame:
"""TODO: Load the image corresponds to 'id' input from 'self.image_folder'.
Use self.items attribute! """
pass
def _load_ann(self, id: int) -> BoundingBoxes2D:
"""TODO: Load the annotations corresponds to 'id' input from 'self.ann_folder'.
Use self.items attribute! """
pass
def getitem(self, idx: int) -> Frame:
"""TODO: Load the image corresponds to 'id' input from 'self.image_folder'"""
frame = self._load_image(idx)
boxes2d = self._load_ann(idx)
frame.append_boxes2d(boxes2d)
return frame
data = CustomDataset(
name="my_dataset",
image_folder="path/image/folder",
ann_folder="path/annotation/folder",
transform_fn=lambda d: d, # Fake transform to give one example
)
for frame in data.stream_loader():
frame.get_view().render()
Important
There are many key concepts in BaseDataset
class:
We recommend to use
self.dataset_dir
attribute to get the dataset root folder. Also, we should define all the paths as relative from it.All information required about each element in dataset will have to be given by
getitem()
function.By default, the dataset size is given by
len(self.items)
.Use
stream_loader()
andtrain_loader()
to get individual samples or batch samples, respectively.
If an application must handled several datasets (like train, valid, test sets), we recommend using the alodataset.SplitMixin
module:
from alodataset import Split, SplitMixin
class CustomDatasetSplitMix(CustomDataset, SplitMixin):
# Mandatory parameter with relative paths
SPLIT_FOLDERS = {
Split.VAL : "val2017",
Split.TRAIN : "train2017",
Split.TEST : "test2017",
}
def __init__(self, name: str, split: Split = Split.TRAIN, **kwargs):
super(CustomDatasetSplitMix, self).__init__(name = name, **kwargs)
self.image_folder = os.path.join(self.image_folder, self.get_split_folder())
self.ann_folder = os.path.join(self.ann_folder, self.get_split_folder())
data = CustomDataset(
name="my_dataset",
image_folder="path/image/folder",
ann_folder="path/annotation/folder",
split=Split.VAL
)
Note
CustomDatasetSplitMix could be developped in one class that used BaseDataset and SplitMixin classes.
Hint
BaseDataset
class is based on torch.utils.data.Dataset module. All information
is provided in Base dataset.
2. Setup a custom dataset based on COCO JSON¶
Many COCO JSON formatted datasets are available on the Internet. For example, roboflow provides several labeled and configured datasets for implementation in machine learning applications. Also, there are many examples of how to setup a dataset using COCO JSON format. Create COCO Annotations From Scratch page explains how to make this work manually. Some tools is able to make and sort the annotation, like roboflow annotations.
As a database handled tool on pytorch for object detection applications, Aloception’s documentation offers a quick configuration module on top of this type of database: Coco detection.
Note
For this part of the tutorial, COCO 2017 detection dataset was used as a custom dataset. On the other hand, we assumed that
the dataset was downloaded and stored in $HOME/data/coco directory. However, it is possible used a datataset based on
COCO JSON format changing img_folder
and ann_file
paths.
For COCO 2017 detection dataset, the valid dataset can be implemented by:
from alodataset import CocoBaseDataset
coco_dataset = CocoBaseDataset(
name = "coco", # Parameter by default, change for others datasets
img_folder = "val2017",
ann_file = "annotations/instances_val2017.json",
mode = "valid"
)
for frame in data.stream_loader():
frame.get_view().render()
Now the module could read and process the images COCO2017 detection valid set.
If a speed setup is required, we could use sample
attribute, without having to download the data set:
coco_dataset = CocoBaseDataset(sample = True)
Warning
sample
feature only applies to datasets managed by Alodataset
3. Make a config file¶
All modules based on BaseDataset
will execute a user prompt after the execution of its declaration if the root directory
of the database does not exist in aloception_config.json file. However, it might be useful to think about configuring this file
for multiple datasets.
First, we need to create the aloception_config.json in $HOME/.aloception directory. This file must contain the following information:
{
"dataset_name_1": "paht/of/dataset_1",
"dataset_name_2": "paht/of/dataset_2",
"...": "...",
"dataset_name_n": "paht/of/dataset_n"
}
An example could be:
{
"coco": "$HOME/data/coco",
"rabbits": "/data/rabbits",
"pascal2012": "/data/pascal",
"raccoon": "$HOME/data/raccon"
}
With this pre-set up, the reading of the directory will be done automatically according to the name
value.
What is next ?
Learn how to train a model using your custom data in Training Detr and Training Deformable DETR tutorials.