Shock was created as a hybrid between a traditional file store and a database. It offers the best of both worlds. All files can be stored immediately with almost zero overhead. But if desired, etadata can be added allowing for a head start on the required business logic.
We designed Shock initially to work with our MG-RAST workflow. At that time implemented as a series of shell scripts and scheduled by Sun Grid engine or Slurm or something (I honestly do not remember).
One of our most frequent problems was: I need to delete all temporary files for jobs X …. Y.
Two problems arise: Where are those jobs and which files are temporary. We note that our pipeline is and was constantly evolving, simply use a regular expression was (and is) not an option.
Shock implements a FileReaper() function that expires nodes when they expire.
Shock was initially created around the year 2009 and one of the main driving forces was our frustation with installing and maintaining large shared file systems. We operated a large NFS store, a GPFS store, A Lustre store. Each of them had good sides, but every one of them was not designed for the scale of our opertion (many hundreds of gigabytes at the time). In addition, we are experts at stealing computational cycles wherever they can be found (sound familiar in cloud times?), with traditional shared file systems the following dialogue got really old: “Ok, we need to mount X on Y and need to have a number of ports openend up on the firewall(s).” Response: “No we cannot do that, why don’t you …”.
Shock is our answer to this dilemma. All you need is port 80. Ah and btw. proxies work as does caching. All we use is HTTP.
Examples are the workflows in MG-RAST, for each metagenome we run a multi-stage (as in ~20) step analytical pipeline. Each steps leaves a data trace of small and large files. We can store all the files, add a TTL (time to live, effectively a file expiration data) and never worry about them. We use this to keep intermediate files around for a few days in case something went wrong.
We created the following types:
- ID: "EBI Submission Receipt"
Description: "EBI Submission Receipt"
Priority: 7
- ID: "inbox"
Description: "MG-RAST inbox node"
Priority: 1
- ID: "metagenome"
Description: "MG-RAST metagenome"
Priority: 9
Data-Types:
- fa
- fasta
- fastq
- fq
- bam
- sam
- ID: "submission"
Description: "MG-RAST submission"
Priority: 9
- ID: "cv"
Description: "Controlled Vocabulary"
Priority: 7
- ID: "ontology"
Description: "ontology"
Priority: 7
- ID: "backup"
Description: "Backup or Dump from another system e.g. MongoDB or MySQL"
Priority: 9
- ID: "metadata"
Description: "metadata"
Priority: 7
- ID: "mixs"
Description: "GSC MIxS Metadata file XLS format"
Priority: 9
Data-Types:
- xls
- xlsx
- json
- ID: "reference"
Description: "reference database"
Priority: 7
- ID: "analysisObject"
Description: "MG-RAST analysisObject"
Priority: 1
- ID: "analysisRecipe"
Description: "MG-RAST analysisRecipe"
Priority: 1
- ID: "preference"
Description: "MG-RAST user preference"
Priority: 1
To create a low maintenance, low cost store for a DNA sequencing service facility. The facility produces multiple Terabytes per week in the form of MANY small and several large files. Some files are ok to delete after a given interval. The source code for the service is available.
The requirements included:
we also created CWL workflows for processing
The sequencing devices stream data onto a Unix server. After initial QC operators can use a set of scripts (e.g. this script) to migrate data into the structured Shock store. The scripts automatically add metadata that enable downstream processing and retrieval.
The basic function of the scripts is invoking CURL with structured JSON
# project_id, owner and type are indexed in SHOCK
JSON="{\
"type" : "run-folder-archive-fastq" , \
"project_id" : "${RUN_FOLDER_NAME}" ,\
"owner" : "${OWNER}", \
"group" : "$group", \
"project" : "$project",\
"sample" : "$sample",\
"name" : "$file"\
}"
It is important to note that ${OWNER} is a service account for the serivce. The other values are extracted from the directory structure created by vendor software from user input.
We use a library of Unix Shell functions to securely write to the Shock server and verify MD5 checksums.
The scripts add metadata, effectively creating data types. We implemented the following data types
- ID: "run-folder-archive-fastq"
Description: "run-folder-archive-fastq"
Priority: 9
Data-Types:
- fastq
- ID: "run-folder-archive-raw"
Description: "run-folder-archive-raw"
Priority: 4
- ID: "run-folder-archive-sav"
Description: "run-folder-archive-sav"
Priority: 9
Data-Types:
- sav
- ID: "run-folder-archive-thumbnails"
Description: "run-folder-archive-thumbnails"
Priority: 1
Data-Types:
-
The facility also provides a web frontend to their end users.