CoCo

An application to store High-Throughput Sequencing data in compact text and binary file formats

Kamil Khanipov, George Golovko, Mark Rojas, Levent Albayrak, Otto Dobretsberger, Maria Pimenova, Nels Olson, Sergei Chumakov, Yuriy Fofanov

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

The storage, manipulation, and especially internet transfer of large amounts of data produced by High-Throughput Sequencing (HTS) instruments present major obstacles to utilizing the full potential of this promising technology. The current standard is based on storing all data, which are produced in text (FASTQ and FASTA) and often stored in binary (SRA and BAM) formats. To date, significant effort has been devoted to efficiently compressing these cumbersome sequencing data sets in their existing formats. However, given the substantial improvements in the quality of HTS data, we believe that if one can afford to exclude low quality data and read headers, new much more compressed data formats can be used to reduce the size of HTS data files by at least two orders of magnitude. Here we present several examples of file formats specifically designed to store only high quality sequencing reads in space efficient text and binary form.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1117-1122
Number of pages6
ISBN (Print)9781467367981
DOIs
StatePublished - Dec 16 2015
EventIEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015 - Washington, United States
Duration: Nov 9 2015Nov 12 2015

Other

OtherIEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015
CountryUnited States
CityWashington
Period11/9/1511/12/15

Fingerprint

Cocos
Information Storage and Retrieval
Quality Improvement
Internet
Throughput
Technology
Data Accuracy
Datasets

Keywords

  • File Formats
  • HTS Data
  • HTS File Converter

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Health Informatics
  • Biomedical Engineering

Cite this

Khanipov, K., Golovko, G., Rojas, M., Albayrak, L., Dobretsberger, O., Pimenova, M., ... Fofanov, Y. (2015). CoCo: An application to store High-Throughput Sequencing data in compact text and binary file formats. In Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015 (pp. 1117-1122). [7359838] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BIBM.2015.7359838

CoCo : An application to store High-Throughput Sequencing data in compact text and binary file formats. / Khanipov, Kamil; Golovko, George; Rojas, Mark; Albayrak, Levent; Dobretsberger, Otto; Pimenova, Maria; Olson, Nels; Chumakov, Sergei; Fofanov, Yuriy.

Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015. Institute of Electrical and Electronics Engineers Inc., 2015. p. 1117-1122 7359838.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Khanipov, K, Golovko, G, Rojas, M, Albayrak, L, Dobretsberger, O, Pimenova, M, Olson, N, Chumakov, S & Fofanov, Y 2015, CoCo: An application to store High-Throughput Sequencing data in compact text and binary file formats. in Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015., 7359838, Institute of Electrical and Electronics Engineers Inc., pp. 1117-1122, IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015, Washington, United States, 11/9/15. https://doi.org/10.1109/BIBM.2015.7359838
Khanipov K, Golovko G, Rojas M, Albayrak L, Dobretsberger O, Pimenova M et al. CoCo: An application to store High-Throughput Sequencing data in compact text and binary file formats. In Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015. Institute of Electrical and Electronics Engineers Inc. 2015. p. 1117-1122. 7359838 https://doi.org/10.1109/BIBM.2015.7359838
Khanipov, Kamil ; Golovko, George ; Rojas, Mark ; Albayrak, Levent ; Dobretsberger, Otto ; Pimenova, Maria ; Olson, Nels ; Chumakov, Sergei ; Fofanov, Yuriy. / CoCo : An application to store High-Throughput Sequencing data in compact text and binary file formats. Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015. Institute of Electrical and Electronics Engineers Inc., 2015. pp. 1117-1122
@inproceedings{0e1ed099f68a43f9b3986cfc4e86e211,
title = "CoCo: An application to store High-Throughput Sequencing data in compact text and binary file formats",
abstract = "The storage, manipulation, and especially internet transfer of large amounts of data produced by High-Throughput Sequencing (HTS) instruments present major obstacles to utilizing the full potential of this promising technology. The current standard is based on storing all data, which are produced in text (FASTQ and FASTA) and often stored in binary (SRA and BAM) formats. To date, significant effort has been devoted to efficiently compressing these cumbersome sequencing data sets in their existing formats. However, given the substantial improvements in the quality of HTS data, we believe that if one can afford to exclude low quality data and read headers, new much more compressed data formats can be used to reduce the size of HTS data files by at least two orders of magnitude. Here we present several examples of file formats specifically designed to store only high quality sequencing reads in space efficient text and binary form.",
keywords = "File Formats, HTS Data, HTS File Converter",
author = "Kamil Khanipov and George Golovko and Mark Rojas and Levent Albayrak and Otto Dobretsberger and Maria Pimenova and Nels Olson and Sergei Chumakov and Yuriy Fofanov",
year = "2015",
month = "12",
day = "16",
doi = "10.1109/BIBM.2015.7359838",
language = "English (US)",
isbn = "9781467367981",
pages = "1117--1122",
booktitle = "Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - CoCo

T2 - An application to store High-Throughput Sequencing data in compact text and binary file formats

AU - Khanipov, Kamil

AU - Golovko, George

AU - Rojas, Mark

AU - Albayrak, Levent

AU - Dobretsberger, Otto

AU - Pimenova, Maria

AU - Olson, Nels

AU - Chumakov, Sergei

AU - Fofanov, Yuriy

PY - 2015/12/16

Y1 - 2015/12/16

N2 - The storage, manipulation, and especially internet transfer of large amounts of data produced by High-Throughput Sequencing (HTS) instruments present major obstacles to utilizing the full potential of this promising technology. The current standard is based on storing all data, which are produced in text (FASTQ and FASTA) and often stored in binary (SRA and BAM) formats. To date, significant effort has been devoted to efficiently compressing these cumbersome sequencing data sets in their existing formats. However, given the substantial improvements in the quality of HTS data, we believe that if one can afford to exclude low quality data and read headers, new much more compressed data formats can be used to reduce the size of HTS data files by at least two orders of magnitude. Here we present several examples of file formats specifically designed to store only high quality sequencing reads in space efficient text and binary form.

AB - The storage, manipulation, and especially internet transfer of large amounts of data produced by High-Throughput Sequencing (HTS) instruments present major obstacles to utilizing the full potential of this promising technology. The current standard is based on storing all data, which are produced in text (FASTQ and FASTA) and often stored in binary (SRA and BAM) formats. To date, significant effort has been devoted to efficiently compressing these cumbersome sequencing data sets in their existing formats. However, given the substantial improvements in the quality of HTS data, we believe that if one can afford to exclude low quality data and read headers, new much more compressed data formats can be used to reduce the size of HTS data files by at least two orders of magnitude. Here we present several examples of file formats specifically designed to store only high quality sequencing reads in space efficient text and binary form.

KW - File Formats

KW - HTS Data

KW - HTS File Converter

UR - http://www.scopus.com/inward/record.url?scp=84962360836&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962360836&partnerID=8YFLogxK

U2 - 10.1109/BIBM.2015.7359838

DO - 10.1109/BIBM.2015.7359838

M3 - Conference contribution

SN - 9781467367981

SP - 1117

EP - 1122

BT - Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015

PB - Institute of Electrical and Electronics Engineers Inc.

ER -