SCALCE

A tool for compressing FASTQ files


Introduction

So what is SCALCE?

SCALCE (/skeɪlz/, a.k.a. boosting Sequence Compression Algorithms using Locally Consistent Encoding) is a tool for compressing FASTQ files. It is designed specifically for the Illumina-generated FASTQ files, but supports any valid FASTQ with consistent read lengths. SCALCE was publised in Bioinformatics in October 2012.

How do I get SCALCE?

Just clone our repository and issue make command:

git clone https://github.com/sfu-compbio/scalce.git
cd scalce
make download
make

If you have issues with compiling, please try our CentOS 7-compiled x86_64 binary.

git clone https://github.com/sfu-compbio/scalce.git binary

If you don’t have git, you can always fetch pre-packaged SCALCE archives:

Note: You will need zlib >= 1.2.6 and libbzip2 library to compile the sources. Unfortunately, RHEL/CentOS 5.x and older come with antiquated versions of zlib, so we recommend downloading the newer version via make download. pigz is also recommended for multi-threaded mode. See Usage for explanation.

Note: SCALCE prior to version v2.7 does not support variable read lengths. Starting with v2.8, EXPERIMENTAL (AND VERY BUGGY) support for varable read lengths has been added. In order to use it, please compile with make -j pacbio and use scalce-pacbio binary to run SCALCE. All options for SCALCE are as well valid for scalce-pacbio. Please note that SCALCE is not designed for very long reads (e.g. PacBio, Nanopore), and thus the compression performance might not be ideal. Also make sure to double-check long read decompression and validity.

How do I use SCALCE?

SCALCE is invoked as following:

Compression

scalce [input_1.fastq] -o [result]

will compress input_1.fastq to the files result.00_1.scalcen.gz, result.00_1.scalcer.gz and result.00_1.scalceq.gz.


scalce [input_1.fastq] -r -o [result] -n [library]

will compress input_1.fastq together with its paired end input_2.fastq, discarding the names and setting library name to library.

Decompression

scalce [input_1.scalcen] -d -o [something.fastq]

will decompress input_1.scalce\* SCALCE file to something.fastq.

Usage

Input and output

SCALCE is a FASTQ compression tool designed specifically for the Illumina-generated FASTQ files. SCALCE will compress provided FASTQ files and generate three output files with extension:

  • .scalcen for read names,
  • .scalcer for reads, and
  • .scalceq for quality scores.

The read length should be fixed in any of the mates in one run. This means, if you are passing 3 paired-end libraries, then the read lengths in the first mate should be fixed (i.e. 50bp). The read lengths for the second mate could be different that those of the first mate.

Mandatory parameters

  • -o, --output

    Specifies the prefix for the output file names. Extensions and basic information will be appended.

    Standard output is supported in decompression mode. Use - to indicate standart output (i.e. -o -). Standard output is supported unless you use -S or -r parameters during decompression.

Shared arguments (both for compression and decompression)

  • -r, --paired-end

    Use paired-end FASTQ files when the two ends are in seprate files. The files should be named with _1 and _2. When you are passing it as input, only give _1 file and SCALCE will replace _1 with _2 and read the second file.

    File name example: XX_1.fq XX_2.fq

  • -n, --skip-names [library]

    Discard original read names, and rename each read with the library prefix, such as library.1, library.2 etc. This option can improve compression rate a lot.

Utility arguments

  • -h, --help

    Prints short usage information.

  • -v, --version

    Prints the current version of SCALCE.

Decompression arguments

  • -d, --decompress

    Uncompress scalce files. Provide just one file name (scalceq for example), and the program will take care for the other files.

  • -S, --split-reads [count]

    Split the output files into a bunch of files, where each file contains the given number of the reads.

    Default: 0 (do not split)

Compression arguments

  • -B, --bucket-set-size [size][MG]

    Set bucket set size (M)egabytes or (G)igabytes. This parameter limits the main memory accessible to SCALCE. Swap files will be used to keep all neccessary data.

    Default: 4G

  • -c, --compression [mode]

    Select compression mode. Currently available modes are:

    • no - No compression
    • gz - gzip compression level 6
    • pigz - parallelized gzip
    • bz - bzip2 compression

    Default: gz, or pigz if number of threads is greater than 1

  • -A, --no-arithmetic

    Disable arithmetic coding for the quality compression and use default compression mode. This helps reduce both compression and decompression time, but the compression ratio may suffer.

    Default: not activated

  • -p, --lossy-percentage [percentage]

    Set lossy error percentage.

    Default: 0 (no lossy)

  • -s, --sample-size [count]

    Specifies how many quality values should be used for statistical analysis during the lossy trasformation table creation.

    Default: 100000

  • -t, --temp-directory [directory]

    Set directory for holding temporary files.

    Default: __temp__

  • -T, --threads [num]

    Specify the number of working threads

    Default: 4 (if the system offers less than 4 cores, number of threads will be automatically adjusted)

Note: In order to take the advantage of multi-threading, pigz binary should be located within the PATH. Otherwise, you should use SCALCE with -T1 (single thread) option

Support

Contact & Support

Feel free to drop any inquiry to inumanag at sfu dot oh canada or fhach at sfu dot oh canada.

Authors

SCALCE has been brought to you by:

from the Lab for Computational Biology at Simon Fraser University, Eicher Lab at University of Washington, and Alkan Lab at Bilkent University.

Licence

Copyright (c) 2011–2012, Simon Fraser University. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the Simon Fraser University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Release notes

  • (10-Jan-2016) SCALCE version 2.8 release
    • Bugfixes (arithmetic decoding bugfix)
    • Fixed a decompression bug when number of reads was greater than 2^32. Compression was not affected.
    • New: support for variable length reads via scalce-pacbio.
  • (20-May-2013) SCALCE version 2.7 release
    • Bugfixes (no-arithmetic fix)
  • (13-May-2013) SCALCE version 2.6 release
    • Bugfixes
  • (02-Apr-2013) SCALCE version 2.5 release
    • Read splitting supported
    • Standard output during decompression supported
    • Bugfixes
  • (25-Mar-2013) SCALCE version 2.4 release
    • Auto-pigz detection
    • Bugfixes
  • (10-Sep-2012) SCALCE version 2.3 release
    • Decompression speed improvements
  • (25-Jul-2012) SCALCE version 2.2 release
    • Speed improvements
    • Arithmetic coding for qualities is now optional
    • Multiple bug fixes
  • (06-Jun-2012) SCALCE version 2.1 release
    • Better compression of reads
    • Arithmetic coding for qualities
    • Multiple bug fixes
  • (02-Mar-2012) SCALCE version 1.4 release
    • Serious data loss when using multithreading bug fixed
  • (20-Feb-2012) SCALCE version 1.3 release
    • Various bug fixes
  • (17-Feb-2012) SCALCE version 1.2 release
    • Various bug fixes
  • (08-Feb-2012) SCALCE version 1.1 release
    • OpenMP support
    • pigz support
  • (06-Dec-2011) SCALCE version 1.0 release
    • Initial release