CoCo: An application to store High-Throughput Sequencing data in compact text and binary file formats

Kamil Khanipov, Georgiy Golovko, Mark Rojas, Levent Albayrak, Otto Dobretsberger, Maria Pimenova, Nels Olson, Sergei Chumakov, Yuriy Fofanov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

The storage, manipulation, and especially internet transfer of large amounts of data produced by High-Throughput Sequencing (HTS) instruments present major obstacles to utilizing the full potential of this promising technology. The current standard is based on storing all data, which are produced in text (FASTQ and FASTA) and often stored in binary (SRA and BAM) formats. To date, significant effort has been devoted to efficiently compressing these cumbersome sequencing data sets in their existing formats. However, given the substantial improvements in the quality of HTS data, we believe that if one can afford to exclude low quality data and read headers, new much more compressed data formats can be used to reduce the size of HTS data files by at least two orders of magnitude. Here we present several examples of file formats specifically designed to store only high quality sequencing reads in space efficient text and binary form.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015
Editorslng. Matthieu Schapranow, Jiayu Zhou, Xiaohua Tony Hu, Bin Ma, Sanguthevar Rajasekaran, Satoru Miyano, Illhoi Yoo, Brian Pierce, Amarda Shehu, Vijay K. Gombar, Brian Chen, Vinay Pai, Jun Huan
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1117-1122
Number of pages6
ISBN (Electronic)9781467367981
DOIs
StatePublished - Dec 16 2015
EventIEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015 - Washington, United States
Duration: Nov 9 2015Nov 12 2015

Publication series

NameProceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015

Other

OtherIEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015
Country/TerritoryUnited States
CityWashington
Period11/9/1511/12/15

Keywords

  • File Formats
  • HTS Data
  • HTS File Converter

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Health Informatics
  • Biomedical Engineering

Fingerprint

Dive into the research topics of 'CoCo: An application to store High-Throughput Sequencing data in compact text and binary file formats'. Together they form a unique fingerprint.

Cite this