With a little help from my splats

14 mai 2026Arpad3dexperiment

In 2015, Tania and I were part of the team at Stink Studios which developed the website and VR app "Inside Abbey Road" for Google Creative Lab. This was an experience (sadly no longer available due to expired content rights) which offered virtual guided tours around Abbey Road Studios in London as well as a mode where users could wander around freely.

This was a huge project in every sense, but the technical challenges were particularly significant, and rewarding!

We wanted the experience to have the familiarity and ease of Google Street View, but richer in every dimension. For example:

Super high quality panoramic imagery (high resolution, HDR, impeccably retouched)
Video loops seamlessly integrated into the panoramas
Spatial background audio (recorded by Abbey Road Studios engineers, naturally)
Tons of embedded content, from archive photography to fully interactive 3D recreations of audio equipment

It wasn't practical to build this on top of Google Street View, so we had the ~~challenge~~ pleasure of building this from the ground up.

Recording the recording studio

After a couple of visits for test captures and lots of prototyping, the core approach we settled on was to pair the high quality panoramic photography with a low-polygon 3D model of each studio.

These 3D models wouldn't be directly visible - there were so many detailed and highly textured objects that it would have been prohibitively expensive to create decent fidelity scenes, both in time and in user bandwidth.

Instead, these models would be used for ray-casting ("is the cursor pointing at the floor or something else?") and for transitions between panoramas.

We used LIDAR scanners to create dense 3D point clouds of each studio.

There are algorithmic methods like Poisson Surface Reconstruction to create meshes from point clouds, but these tend to result in messy geometry from which it's difficult to retain the key details we need in a low-poly mesh, so instead we used the 3D point cloud as a geometry reference for a conventional modelling workflow.

The photographic rig used for the panorama capture also included its own LIDAR scanner which gave us a 2D point cloud which we later aligned with the 3D cloud to be able to accurately position each panorama in 3D space.

Some environments were tricky both for photography and lasers!

The 2015 approach

The key trick to make the low-poly models useful for transitions was projective texturing.

Usually the faces of 3D models have texture (UV) coordinates used to map a texture to the surface.

We instead projected our panoramas against the low-poly mesh from the position where they were taken.

If you're looking from the exact position of a panorama and it's the only panorama projected, everything looks perfect, but if you blend another panorama from a different position or you move the camera, things start to look skewed.

These limitations were ok for our purposes. In the context of a transition from one panorama to another, we'd always start and finish with a perfect projection mapping, and we could blend multiple projection sources along the animation path to minimise distortion.

Credit: Stink Studios

The 2026 approach

A 2020 paper introduced NeRFs (Neural Radiance Fields), a way of training a neural network to represent a 3D scene using only a set of 2D images, allowing it to generate new images from different positions. For example if your training set comprised photos taken from the front, left, right of an object, the NeRF can be used to render an image which could be a pretty good approximation of the view from behind the object.

This is very impressive, but requires lots of time to train the NeRF, and the rendering is vastly more expensive than typical polygon-rendering, so not really suitable for a web app intended to be broadly accessible.

More recently, 3D Gaussian Splatting (3DGS) has emerged as an alternative to NeRFs which is far more efficient to render, allowing real-time framerates even on pretty low-end devices. There are now easy integrations for all the major game engines and web-based 3D frameworks.

I've long been curious about using 3DGS to improve on our old projective texturing approach, and I've finally got around to trying this out.

I used Nerfstudio for training. The dependencies are fiddly - I ended up installing four different versions of the CUDA toolkit while trying to get everything to build, and finally had to manually edit C header files to get around interdependent libraries expecting different versions. This was just as much fun as the last time I tried using something relying on CUDA on Debian.

However once up and running, the Nerfstudio helper script for custom data works smoothly. It uses ffmpeg to project planar images from the panoramic images and then COLMAP for the heavy lifting.

Aside: Our 3D point clouds also included colour data (the scanner had an integrated camera), so in theory we could skip the 3D reconstruction using COLMAP and train directly using the point cloud. Unfortunately in our case, the studios were set up differently on the days we did the LIDAR scans from the days we did the panoramic photography, so they wouldn't match.

After about 10 minutes of processing we ended up with a 604 MB PLY file, which I converted to a 35 MB SOG file using SplatTransform.

I used Spark to load and render the model into a Three.js scene I'd prepared for a quick prototype:

I didn't spend any time tweaking the training parameters so I'm sure the result could be improved, but it's pretty great already! For the purpose of a transition animation it's already a better result - every frame is spatially consistent.

It's fantastic that such fine 3D structures as the microphone stands are preserved, but it's interesting that certain surfaces like the polished tops of the grand pianos and the ventilation grilles on the walls are badly interpreted. It's likely that this can be refined, or tackled with a hybrid rendering approach.

There is a slight spatial misalignment between the Gaussian coordinate space and that of the panoramas, but that's almost certainly an artefact of my prototype. I was more interested in testing the end-to-end workflow than in perfecting the alignment.

The amazing (and humbling...) thing is that it was generated automatically, in a matter of minutes, entirely from photography. It seems very likely that if we made something like "Inside Abbey Road" today, we wouldn't need the LIDAR pipeline at all.

I believe that even if we're not quite there now, it's also not long before we wouldn't need the panoramic capture either - a sufficient number of photos fed into such a system will be capable of producing just as high a fidelity representation of the physical space.

Hats off to the awesome research which has made this possible, and the already-rich ecosystem of Open Source software around it!