High level description of the preprocessing pipeline
To process a large stack of mouse brain images into a format that neuroscientists can use requires a great deal of computational overhead. For the processing, the programming language pyython is used which has a large user community and a rich set of libraries.
Our process makes extensive use of the following python libraries:
numpy Numpy takes the image data as arrays and is very efficient at peforming many tasks on arrays of data.
opencv This library has many methods for image manipulation
pillow Another library for image manipulation
scikit image Another library for image manipulation
simpleITK A library for image registration, analysis, segmentation and more.
Igneous labs cloud volume This is the library that process the aligned images into Neuroglancer data.
To get a full listing of all python libraries, see the requirements.txt file
The entire process is run from one script
src/pipeline/scripts/create_pipeline.py
For instructions on running the pipeline see this HOWTO
Raw images processing
The raw images that come of the scanning microscope are in a proprietary format called CZI. These files are compressed and contain the images (around 20 per czi file) and also a great deal of metadata that describes the scanner and also the images. This metadata gets parsed with the aicspylibczi tools from the CZI files and inserted into the MySQL database with an ORM library (Object relational mapping) called sqlalchemy.
After the TIF data has been extracted and the metadata inserted into the database, the user can verify the quality of the images and then proceed with the pipeline.
The next steps invole creating histograms for each file and downsampled versions of the images that can be viewed on the web in a browser.
Masking and cleaning
Masking is used to remove the debris, glue and junk that is in the clear areas outside the sections. It is important to have clean images as they look better and it also makes the alignment process more reliable.
Section to section alignment
After the images have been cleaned they are ready for alignment.
We use a tool called Elastix. This tool performs correlation between each adjoing image and returns a rotation, x-shift, and y-shift for each consecutive set of images. This image is also stored in the database in the elastix_transformation table.
The alignment is performed on the downsampled images as the full resolution images would take too long to process. The full resolution images have the same rotation but the x and y translations are multiplied by a scaling factor.
Neuroglancer
The aligned images are now ready to be processed into Neuroglancer’s default image type: precomputed
This part of the pipeline makes extensive use of the Igneous labs cloud volume library.
The aligned stack of full resolution images is about 5-600GB in size. These files need to be broken down into chunks that Neuroglancer can use. The cloud volume library takes the aligned images and breaks them into the chunks. These chunks of data get broken down into 9 different directories in the data directory. Each of these directories contains files that describe a different resolution in Neuroglancer. The following directory structure describes the available resolutions with the first directory being the original resolution of the images in nanometers. x=325, y=325, z=20000
Directory |
Size |
Number of files |
---|---|---|
325_325_20000/ |
381GB |
1,261,029 |
650_650_20000/ |
97GB |
316,820 |
1300_1300_20000/ |
25GB |
79,716 |
2600_2600_20000/ |
6.2GB |
19,929 |
5200_5200_20000/ |
1.6GB |
5,180 |
10400_10400_20000/ |
405MB |
1,330 |
20800_20800_20000/ |
111MB |
1,330 |
41600_41600_20000/ |
29MB |
350 |
83200_83200_40000/ |
4.2MB |
60 |
There are two steps to creating the precomputed format:
Create the intial chunk size of (64,64,1). Neuroglancer serves data from the webserver in chunks. The initial chunk only has a z length of 1. This is necessary for the initial creation. However, this chunk size results in too many files and needs to be transfered by the next step in the process which creates a better chunk size and results in the pyramid scheme that is best for viewing in a web browser. This data is stored in /net/birdstore/Active_Atlas_Data/data_root/pipeline_data/DKXX/neuroglancer_data/CX_rechunkme
The 2nd phase in the precomputed process creates a set of optimum chunks from the directory created in the previous step and places the new pyramid files in /net/birdstore/Active_Atlas_Data/data_root/pipeline_data/DKXX/neuroglancer_data/CX This data is now ready to be served by the Apache web server. Note that all the chunks (and there can be millions of files) are compressed with gzip and so the Apache web server must be configured to serve compressed files. This is done in one of the configuration files under the Apache configuration directory on the web server.