The annotated dataset is represented with a single JSON file (COCO_SBX_2021_03_19.json) in the dataset folder.
# high level coco json structure
{
"images": [...],
"annotations": [...],
"categories": [...],
"info": {...}
}
The images section has one entry for every RGB and depth frame in the dataset. Some important fields:
# image data
{
'file_name': 'SBXSensor_3_00000000.png',
'height': 480,
'width': 640,
'id': 3,
'scene_id': 0,
'channel': 'rgb'
}
The annotations section includes one entry for every annotated item across all images. Our conventions:
# annotation data {
"segmentation": { # RLE of segmentation mask
"counts": [....],
"size": [1080, 1920]
},
"area": 96852.0,
"iscrowd": 0,
"image_id": 1,
"bbox": [825.0, 580.0, 241.0, 499.0],
"category_id": 3,
"id": 0,
}
The categories section enumerates the types of annotated objects in the scenes.
# category data {
{'id': 0, 'name': 'sample_item1', 'supercategory': ''},
{'id': 1, 'name': 'sample_item2', 'supercategory': ''},
...
}
We use the OpenCV pinhole camera model. Positive Z axis points into the scene, positive X axis points to the right and positive Y axis points towards the bottom of the image plane.
Each RGB image will have a unique set of camera intrinsics K.
# example K matrix
array([[1.29900928e+03, 0.00000000e+00, 9.60000000e+02],
[0.00000000e+00, 1.29900928e+03, 5.40000000e+02],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00]])
Every item has an associated camera relative pose matrix that describes the translation and orientation of the item relative to the camera. All units are in meters (m) for translation quantities.
RT = np.matmul(world_to_camera_pose_RT, object_to_world_pose_RT)
# example RT matrix
# RT[:3, :3] is rotation, RT[:, 3] is translation
array([[-0.51877965, 0.81398015, -0.26135031, -0.34774154],
[-0.84177769, -0.53973034, -0.01007326, 0.18114404],
[-0.14925813, 0.21477306, 0.96519145, 1.06569358]])
The visible percentage is a measure of the item’s occlusion in the scene.
We project the perspective transformed bounding box of the item and calculate the fraction of visible pixels / total pixels within the perspective transformed bounding box. A value of 0.0 denotes no visible item parts and 1.0 would denote full view of the item from the given perspective.
In the example below the visible percentage metric would be 0.1515.
Please provide your name and email address for a link to download the SBX sample dataset.
Projecting the 2D bounding box field into the RGB image for an associated annotation.
Load the dataset in python3 using pycocotools, and display the annotation mask for a particular sample.
Iterate through all the images, showing projecting the masks and showing the raw masks for each frame.
Load corresponding RGB and Depth frames for a given scene.
The depth images are scaled depth_factor which transforms the uint16 png depth frame into metric distances (m)
'id': 3,
'scene_id': 0,
'channel': 'rgb'
'id': 103,
'scene_id': 0,
'channel': 'depth'
Each annotation has a corresponding entry that describes the geometry of the bounding box as a 8x3 matrix. The 8 points represent the 8 corners of the item in the item’s coordinate system. All units are in meters (m).
The bounding box 3D points can be transformed and projected into the camera frame with the intrinsic and relative pose matrix, as shown in the example below.
The example code will plot a red mask over the item with the bounding box corners drawn as yellow circles.
For datasets concerned with 6D pose, the meshes for items are included in the meshes sub-directory
This example will produce a 4-up display for each annotation including : 2D bounding box, 3D bounding box, projected sample of mesh vertices onto both the RGB and Depth frames