A Friendly Guide to Public Indoor Environment Datasets

Understanding images of indoor home environments is a fundamental task for many applications of computer vision. One of the challenges in advancing computer vision is the availability of suitable datasets on which models can be trained. Particularly useful are public indoor datasets. Indoor means interior spaces such as within homes, buildings, offices, and the like. Public datasets are open-source and can be used freely for research purposes. 

Generally, training datasets can be split into two broad categories: Manually Captured Datasets and Synthetically Generated Datasets. Manually Captured Datasets are comprised of data collected from the real world and then annotated. Synthetically Generated Datasets are made of data created by algorithms and computer graphics, meant to mimic the real world. Depending on the application or research needs, computer vision researchers might be interested in both. And, especially at the outset of research projects, may spend a lot of time searching for high-quality public indoor datasets that can jumpstart their efforts. 

In this post, we provide a quick guide to some of the most popular high-quality, public datasets focused on training computer vision systems for indoor environment understanding. We’ll touch on some of their key characteristics and strengths and weaknesses of each. Over time, Datagen plans to contribute our own publicly available Simulated Datasets to the ecosystem. These are some of the datasets we look to as examples:

Manual Datasets 

Replica 

  • Affiliation – Facebook, Georgia Tech, Simon Fraser University
  • Released – June 2019 
  • Description – Dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale.
  • Scenes – 18 
  • Rooms – 35 
  • Frames –  The data is a 3D house simulation. There are no frames per se, rather frames can be generated from the simulation.
  • Platform – custom-built RGB-D capture rig with an IR projector
  • Available Labels –  RGB, depth, semantic instance, and semantic class segmentation 
  • Bottom Line –  The Replica dataset has a very high level of quality but on a small scale. It is free for non-commercial uses such as research and education. 

ScanNet

  • Affiliation –  Stanford University,  Princeton University, Technical University of Munich 
  • Released –  February 2017
  • Description – ScanNet is an RGB-D video dataset containing 2.5 million views in more than 1500 scans
  • Scenes – 1,513
  • Rooms – 707
  • Frames – 2.5 Million
  • Platform – Structure Sensor + iPad
  • Scene Design Type 
  • Available Labels – 3D camera poses, surface reconstructions, and instance-level semantic segmentations.
  • Image Resolution – Depth frames 640 × 480 & color at 1296 × 968 pixels
  • Bottom Line – The semantic segmentation method used is limited in accuracy. Researchers can use the database freely, only for non-commercial research and educational purposes.

Read our survey on Synthetic Data: The Key to Production AI in 2022

MatterPort 3D

  • Affiliation – Stanford University,  Princeton University, Technical University of Munich
  • Released – September 2017
  • Description – Matterport3D, a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes.
  • Scenes – 90 
  • Rooms – 2056 
  • Frames – 194,400
  • Platform – MatterPort Camera
  • Available Labels – surface reconstructions, camera poses, and 2D and 3D semantic segmentation.
  • Image Resolution – 1280×1024
  • Bottom Line – The quality of the data is relatively high, but there are some issues with the geometry and lighting. Additionally, the segmentation is done similarly to the method used by ScanNet and has the same limits in terms of accuracy and resolution. Matterport can be used freely for both commercial and non-commercial uses. 

SceneNN

  • Affiliation – University of Tokyo Singapore University of Technology and Design  Deakin University George Mason University  The Hong Kong University of Science and Technology
  • Released – 2016
  • Description – An RGB-D scene dataset consisting of more than 100 indoor scenes. Our scenes are captured at various places, e.g., offices, dormitory, classrooms, pantry, etc.,
  • Scenes – 100
  • Rooms – 95
  • Platform – Asus Xtion
  • Bottom Line – The quality and photorealism of the dataset is very high, but the scale is limited due to the amount of manual effort required. The dataset is free for educational and research purposes.

indoor environment

Synthetic Datasets

Structured3D 

  • Affiliation – ShanghaiTech University
  • Released – August 2019
  • Description – Structured3D is a large-scale photo-realistic dataset containing 3.5K house designs created by professional designers with a variety of ground truth 3D structure annotations and generate photo-realistic 2D images. 
  • Scenes – 3500
  • Rooms – 21,835
  • Frames – 196,000
  • Scene Design Type – Profesional
  • Available Labels – Rich ground truth 3D structure annotations
  • Image Resolution – 720×1280
  • Bottom Line –  A large and very high-quality dataset. Additionally, this dataset, like the other synthetic datasets has perfect ground truth. The dataset can be used freely only for non-commercial research and educational purposes.

InteriorNet 

  • Affiliation – Imperial College London, Kujiale.
  • Released – 2018
  • Description – an end-to-end pipeline to render an RGB-D-inertial benchmark for large scale interior scene understanding and mapping. 
  • Scenes – 10,000
  • Rooms 1.7 Million 
  • Frames – 5 Million 
  • Scene Design Type – Professional
  • Available Labels – RGB, depth, semantic instance, and semantic class segmentation.
  • Image Resolution – 640×480 
  • Bottom Line – Very realistic, has a hard time capturing real spaces with all their imperfections due to use, clutter, and semantic variety. The dataset can be used freely only for non-commercial research and educational purposes.

SceneNet 

  • Affiliation – University of Cambridge
  • Released – November 2015
  • Description – The main contribution of this work is to propose a new dataset of annotated 3D scenes that can generate virtually unlimited ground truth training data.
  • Scenes – 57
  • Rooms – 1000
  • Frames – The data is a 3D house simulation. There are no frames per se, rather frames can be generated from the simulation.
  • Scene Design Type – Random/Manual
  • Available Labels – Per-pixel semantic labeling
  • Bottom Line – SceneNet features very detailed segmentation and perfect ground truth. The dataset was released under a creative commons license which is purely for research purposes only.

SceneNet RGB-D 

  • Affiliation – Imperial Collge London
  • Released – 2016 
  • Description – Expanding upon the previous work of SceneNet to enable large scale photorealistic rendering of indoor scene trajectories. 
  • Rooms – 57
  • Scenes – 16,895
  • Frames – 5 Million
  • Scene Design Type – Random
  • Available Labels – The dataset provides pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection, optical flow, depth estimation, camera pose estimation, and 3D reconstruction.
  • Image Resolution – 320×240
  • Bottom Line – An expanded and advanced version of SceneNet. Each layout also has random lighting, camera trajectories, and textures which enables a continuous stream of unseen training examples. Like SceneNet, this dataset was released under a creative commons license which is purely for research purposes only.

SUNCG

  • Affiliation – Princeton University
  • Released – November 2016
  • Description – This dataset is not photorealistic but focuses on occluded surfaces and achieving accurate geometric representation of the objects in the image.
  • Scenes – 45,622
  • Rooms – 404,058
  • Frames – The data is a 3D house simulation. There are no frames per se, rather frames can be generated from the simulation.
  • Scene Design Type – Manual
  • Available Labels – The dataset introduces the semantic scene completion network (SSCNet), an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously out
  • puts occupancy and semantic labels for all voxels in the camera view frustum. 
  • Bottom Line – Large scale and great at dealing with occluded surfaces. However, the scenes lack photorealistic appearance and are often semantically overly simplistic. This dataset is free for educational and research purposes. 

As you can tell, there are numerous options when it comes to datasets of indoor environments with significant nuances and differences. One thing to note is the trade-off between scale and quality. This is a constant feature to keep top of mind when exploring datasets. Synthetic data has a huge advantage when it comes to scaling, as the manual effort in collecting and annotating is huge and very costly to scale. But, historically, synthetic data has had the challenge of retaining the photorealism of manual datasets and not sacrificing quantity for quality (a problem that Datagen is committed to solving).

Another key differentiator between the sets is types and quality of annotation – certain annotation types are irrelevant for key applications and low quality can significantly impact application performance.

Choosing a dataset to use can be tough and you may spend a lot of time researching your various options. We hope that this resource expedites your search. In future posts, we’ll be providing similar summaries of public datasets for other applications such as hands, faces, and bodies.

Read our survey on Synthetic Data: The Key to Production AI in 2022