IGF database schema and api¶
Database schema¶
-
class
igf_data.igfdb.igfTables.
Analysis
(**kwargs)¶ A table for loading analysis design information
- Parameters
analysis_id – An integer id for analysis table
project_id – A required integer id from project table (foreign key)
analysis_type –
An optional enum list to specify analysis type, default is UNKNOWN, allowed values are
RNA_DIFFERENTIAL_EXPRESSION
RNA_TIME_SERIES
CHIP_PEAK_CALL
SOMATIC_VARIANT_CALLING
UNKNOWN
analysis_description – An optional json description for analysis
-
class
igf_data.igfdb.igfTables.
Collection
(**kwargs)¶ A table for loading collection information
- Parameters
collection_id – An integer id for collection table
name – A required string to specify collection name, allowed length 70
type – A required string to specify collection type, allowed length 50
table – An optional enum list to specify collection table information, default unknown, allowed values are sample, experiment, run, file, project, seqrun and unknown
date_stamp – An optional timestamp column to record entry creation or modification time, default current timestamp
-
class
igf_data.igfdb.igfTables.
Collection_attribute
(**kwargs)¶ A table for loading collection attributes
- Parameters
collection_attribute_id – An integer id for collection_attribute table
attribute_name – An optional string attribute name, allowed length 200
attribute_value – An optional string attribute value, allowed length 200
collection_id – An integer id from collection table (foreign key)
-
class
igf_data.igfdb.igfTables.
Collection_group
(**kwargs)¶ A table for linking files to the collection entries
- Parameters
collection_group_id – An integer id for collection_group table
collection_id – A required integer id from collection table (foreign key)
file_id – A required integer id from file table (foreign key)
-
class
igf_data.igfdb.igfTables.
Experiment
(**kwargs)¶ A table for loading experiment (unique combination of sample, library and platform) information.
- Parameters
experiment_id – An integer id for experiment table
experiment_igf_id – A required string as experiment id specific to IGF team, allowed length 40
project_id – A required integer id from project table (foreign key)
sample_id – A required integer id from sample table (foreign key)
library_name – A required string to specify library name, allowed length 50
library_source –
An optional enum list to specify library source information, default is UNKNOWN, allowed values are
GENOMIC
TRANSCRIPTOMIC
GENOMIC_SINGLE_CELL
TRANSCRIPTOMIC_SINGLE_CELL
METAGENOMIC
METATRANSCRIPTOMIC
SYNTHETIC
VIRAL_RNA
UNKNOWN
library_strategy –
An optional enum list to specify library strategy information, default is UNKNOWN, allowed values are
WGS
WXS
WGA
RNA-SEQ
CHIP-SEQ
ATAC-SEQ
MIRNA-SEQ
NCRNA-SEQ
FL-CDNA
EST
HI-C
DNASE-SEQ
WCS
RAD-SEQ
CLONE
POOLCLONE
AMPLICON
CLONEEND
FINISHING
MNASE-SEQ
DNASE-HYPERSENSITIVITY
BISULFITE-SEQ
CTS
MRE-SEQ
MEDIP-SEQ
MBD-SEQ
TN-SEQ
VALIDATION
FAIRE-SEQ
SELEX
RIP-SEQ
CHIA-PET
SYNTHETIC-LONG-READ
TARGETED-CAPTURE
TETHERED
NOME-SEQ
CHIRP SEQ
4-C-SEQ
5-C-SEQ
UNKNOWN
experiment_type –
An optional enum list as experiment type information, default is UNKNOWN, allowed values are
POLYA-RNA
POLYA-RNA-3P
TOTAL-RNA
SMALL-RNA
WGS
WGA
WXS
WXS-UTR
RIBOSOME-PROFILING
RIBODEPLETION
16S
NCRNA-SEQ
FL-CDNA
EST
HI-C
DNASE-SEQ
WCS
RAD-SEQ
CLONE
POOLCLONE
AMPLICON
CLONEEND
FINISHING
DNASE-HYPERSENSITIVITY
RRBS-SEQ
WGBS
CTS
MRE-SEQ
MEDIP-SEQ
MBD-SEQ
TN-SEQ
VALIDATION
FAIRE-SEQ
SELEX
RIP-SEQ
CHIA-PET
SYNTHETIC-LONG-READ
TARGETED-CAPTURE
TETHERED
NOME-SEQ
CHIRP-SEQ
4-C-SEQ
5-C-SEQ
METAGENOMIC
METATRANSCRIPTOMIC
TF
H3K27ME3
H3K27AC
H3K9ME3
H3K36ME3
H3F3A
H3K4ME1
H3K79ME2
H3K79ME3
H3K9ME1
H3K9ME2
H4K20ME1
H2AFZ
H3AC
H3K4ME2
H3K4ME3
H3K9AC
HISTONE-NARROW
HISTONE-BROAD
CHIP-INPUT
ATAC-SEQ
TENX-TRANSCRIPTOME-3P
TENX-TRANSCRIPTOME-5P
DROP-SEQ-TRANSCRIPTOME
UNKNOWN
library_layout –
An optional enum list to specify library layout, default is UNONWN allowed values are
SINGLE
PAIRED
UNKNOWN
status –
An optional enum list to specify experiment status, default is ACTIVE, allowed values are
ACTIVE
FAILED
WITHDRAWN
date_created – An optional timestamp column to record entry creation or modification time, default current timestamp
platform_name –
An optional enum list to specify platform model, default is UNKNOWN, allowed values are
HISEQ250
HISEQ4000
MISEQ
NEXTSEQ
NANOPORE_MINION
DNBSEQ-G400
DNBSEQ-G50
DNBSEQ-T7
UNKNOWN
-
class
igf_data.igfdb.igfTables.
Experiment_attribute
(**kwargs)¶ A table for loading experiment attributes
- Parameters
experiment_attribute_id – An integer id for experiment_attribute table
attribute_name – An optional string attribute name, allowed length 30
attribute_value – An optional string attribute value, allowed length 50
experiment_id – An integer id from experiment table (foreign key)
-
class
igf_data.igfdb.igfTables.
File
(**kwargs)¶ A table for loading file information
- Parameters
file_id – An integer id for file table
file_path – A required string to specify file path information, allowed length 500
location –
An optional enum list to specify storage location, default UNKNOWN, allowed values are
ORWELL
HPC_PROJECT
ELIOT
IRODS
UNKNOWN
status –
An optional enum list to specify experiment status, default is ACTIVE, allowed values are
ACTIVE
FAILED
WITHDRAWN
md5 – An optional string to specify file md5 value, allowed length 33
size – An optional string to specify file size, allowed value 15
date_created – An optional timestamp column to record file creation time, default current timestamp
date_updated – An optional timestamp column to record file modification time, default current timestamp
-
class
igf_data.igfdb.igfTables.
File_attribute
(**kwargs)¶ A table for loading file attributes
- Parameters
file_attribute_id – An integer id for file_attribute table
attribute_name – An optional string attribute name, allowed length 30
attribute_value – An optional string attribute value, allowed length 50
file_id – An integer id from file table (foreign key)
-
class
igf_data.igfdb.igfTables.
Flowcell_barcode_rule
(**kwargs)¶ A table for loading flowcell specific barcode rules information
- Parameters
flowcell_rule_id – An integer id for flowcell_barcode_rule table
platform_id – An integer id for platform table (foreign key)
flowcell_type – A required string as flowcell type name, allowed length 50
index_1 –
An optional enum list as index_1 specific rule, default UNKNOWN, allowed values are
NO_CHANGE
REVCOMP
UNKNOWN
index_2 –
An optional enum list as index_2 specific rule, default UNKNOWN, allowed values are
NO_CHANGE
REVCOMP
UNKNOWN
-
class
igf_data.igfdb.igfTables.
History
(**kwargs)¶ A table for loading history information
- Parameters
log_id – An integer id for history table
log_type –
A required enum value to specify log type, allowed values are
CREATED
MODIFIED
DELETED
table_name –
A required enum value to specify table information, allowed values are
PROJECT
USER
SAMPLE
EXPERIMENT
RUN
COLLECTION
FILE
PLATFORM
PROJECT_ATTRIBUTE
EXPERIMENT_ATTRIBUTE
COLLECTION_ATTRIBUTE
SAMPLE_ATTRIBUTE
RUN_ATTRIBUTE
FILE_ATTRIBUTE
log_date – An optional timestamp column to record file creation or modification time, default current timestamp
message – An optional text field to specify message
-
class
igf_data.igfdb.igfTables.
Pipeline
(**kwargs)¶ A table for loading pipeline information
- Parameters
pipeline_id – An integer id for pipeline table
pipeline_name – A required string to specify pipeline name, allowed length 50
pipeline_db – A required string to specify pipeline database url, allowed length 200
pipeline_init_conf – An optional json field to specify initial pipeline configuration
pipeline_run_conf – An optional json field to specify modified pipeline configuration
pipeline_type –
An optional enum list to specify pipeline type, default EHIVE, allowed values are
EHIVE
UNKNOWN
is_active – An optional enum list to specify the status of pipeline, default Y, allowed values are Y and N
date_stamp – An optional timestamp column to record file creation or modification time, default current timestamp
-
class
igf_data.igfdb.igfTables.
Pipeline_seed
(**kwargs)¶ A table for loading pipeline seed information
- Parameters
pipeline_seed_id – An integer id for pipeline_seed table
seed_id – A required integer id
seed_table – An optional enum list to specify seed table information, default unknown, allowed values project, sample, experiment, run, file, seqrun, collection and unknown
pipeline_id – An integer id from pipeline table (foreign key)
status –
An optional enum list to specify the status of pipeline, default UNKNOWN, allowed values are
SEEDED
RUNNING
FINISHED
FAILED
UNKNOWN
date_stamp – An optional timestamp column to record file creation or modification time, default current timestamp
-
class
igf_data.igfdb.igfTables.
Platform
(**kwargs)¶ A table for loading sequencing platform information
- Parameters
platform_id – An integer id for platform table
platform_igf_id – A required string as platform id specific to IGF team, allowed length 10
model_name –
A required enum list to specify platform model, allowed values are
HISEQ2500
HISEQ4000
MISEQ
NEXTSEQ
NOVASEQ6000
NANOPORE_MINION
DNBSEQ-G400
DNBSEQ-G50
DNBSEQ-T7
vendor_name –
A required enum list to specify vendor’s name, allowed values are
ILLUMINA
NANOPORE
MGI
software_name –
A required enum list for specifying platform software, allowed values are
RTA
UNKNOWN
software_version – A optional software version number, default is UNKNOWN
date_created – An optional timestamp column to record entry creation time, default current timestamp
-
class
igf_data.igfdb.igfTables.
Project
(**kwargs)¶ A table for loading project information
- Parameters
project_id – An integer id for project table
project_igf_id – A required string as project id specific to IGF team, allowed length 50
project_name – An optional string as project name
start_timestamp – An optional timestamp for project creation, default current timestamp
description – An optional text column to document project description
deliverable –
An enum list to document project deliverable, default FASTQ, allowed entries are
FASTQ
ALIGNMENT
ANALYSIS
status –
An enum list for project status, default ACTIVE allowed entries are
ACTIVE
FINISHED
WITHDRAWN
-
class
igf_data.igfdb.igfTables.
ProjectUser
(**kwargs)¶ A table for linking users to the projects
- Parameters
project_user_id – An integer id for project_user table
project_id – An integer id for project table (foreign key)
user_id – An integer id for user table (foreign key)
data_authority – An optional enum value to denote primary user for the project, allowed value T
-
class
igf_data.igfdb.igfTables.
Project_attribute
(**kwargs)¶ A table for loading project attributes
- Parameters
project_attribute_id – An integer id for project_attribute table
attribute_name – An optional string attribute name, allowed length 50
attribute_value – An optional string attribute value, allowed length 50
project_id – An integer id from project table (foreign key)
-
class
igf_data.igfdb.igfTables.
Run
(**kwargs)¶ A table for loading run (unique combination of experiment, sequencing flowcell and lane) information
- Parameters
run_id – An integer id for run table
run_igf_id – A required string as run id specific to IGF team, allowed length 70
experiment_id – A required integer id from experiment table (foreign key)
seqrun_id – A required integer id from seqrun table (foreign key)
status –
An optional enum list to specify experiment status, default is ACTIVE, allowed values are
ACTIVE
FAILED
WITHDRAWN
lane_number – A required enum list for specifying lane information, allowed values 1, 2, 3, 4, 5, 6, 7 and 8
date_created – An optional timestamp column to record entry creation time, default current timestamp
-
class
igf_data.igfdb.igfTables.
Run_attribute
(**kwargs)¶ A table for loading run attributes
- Parameters
run_attribute_id – An integer id for run_attribute table
attribute_name – An optional string attribute name, allowed length 30
attribute_value – An optional string attribute value, allowed length 50
run_id – An integer id from run table (foreign key)
-
class
igf_data.igfdb.igfTables.
Sample
(**kwargs)¶ A table for loading sample information
- Parameters
sample_id – An integer id for sample table
sample_igf_id – A required string as sample id specific to IGF team, allowed length 20
sample_submitter_id – An optional string as sample name from user, allowed value 40
taxon_id – An optional integer NCBI taxonomy information for sample
scientific_name – An optional string as scientific name of the species
species_name – An optional string as the species name (genome build code) information
donor_anonymized_id – An optional string as anonymous donor name
description – An optional string as sample description
phenotype – An optional string as sample phenotype information
sex –
An optional enum list to specify sample sex, default UNKNOWN allowed values are
FEMALE
MALE
MIXED
UNKNOWN
status –
An optional enum list to specify sample status, default ACTIVE, allowed values are
ACTIVE
FAILED
WITHDRAWS
biomaterial_type –
An optional enum list as sample biomaterial type, default UNKNOWN, allowed values are
PRIMARY_TISSUE
PRIMARY_CELL
PRIMARY_CELL_CULTURE
CELL_LINE
SINGLE_NUCLEI
UNKNOWN
cell_type – An optional string to specify sample cell_type information, if biomaterial_type is PRIMARY_CELL or PRIMARY_CELL_CULTURE
tissue_type – An optional string to specify sample tissue information, if biomaterial_type is PRIMARY_TISSUE
cell_line – An optional string to specify cell line information ,if biomaterial_type is CELL_LINE
date_created – An optional timestamp column to specify entry creation date, default current timestamp
project_id – An integer id for project table (foreign key)
-
class
igf_data.igfdb.igfTables.
Sample_attribute
(**kwargs)¶ A table for loading sample attributes
- Parameters
sample_attribute_id – An integer id for sample_attribute table
attribute_name – An optional string attribute name, allowed length 50
attribute_value – An optional string attribute value, allowed length 50
sample_id – An integer id from sample table (foreign key)
-
class
igf_data.igfdb.igfTables.
Seqrun
(**kwargs)¶ A table for loading sequencing run information
- Parameters
seqrun_id – An integer id for seqrun table
seqrun_igf_id – A required string as seqrun id specific to IGF team, allowed length 50
reject_run – An optional enum list to specify rejected run information ,default N, allowed values Y and N
date_created – An optional timestamp column to record entry creation time, default current timestamp
flowcell_id – A required string column for storing flowcell_id information, allowed length 20
platform_id – An integer platform id (foreign key)
-
class
igf_data.igfdb.igfTables.
Seqrun_attribute
(**kwargs)¶ A table for loading seqrun attributes
- Parameters
seqrun_attribute_id – An integer id for seqrun_attribute table
attribute_name – An optional string attribute name, allowed length 50
attribute_value – An optional string attribute value, allowed length 100
seqrun_id – An integer id from seqrun table (foreign key)
-
class
igf_data.igfdb.igfTables.
Seqrun_stats
(**kwargs)¶ A table for loading sequencing stats information
- Parameters
seqrun_stats_id – An integer id for seqrun_stats table
seqrun_id – An integer seqrun id (foreign key)
lane_number – A required enum list for specifying lane information, allowed values are 1, 2, 3, 4, 5, 6, 7 and 8
bases_mask – An optional string field for storing bases mask information
undetermined_barcodes – An optional json field to store barcode info for undetermined samples
known_barcodes – An optional json field to store barcode info for known samples
undetermined_fastqc – An optional json field to store qc info for undetermined samples
-
class
igf_data.igfdb.igfTables.
User
(**kwargs)¶ A table for loading user information
- Parameters
user_id – An integer id for user table
user_igf_id – An optional string as user id specific to IGF team, allowed length 10
name – A required string as user name, allowed length 30
email_id – A required string as email id, allowed length 40
username – A required string as IGF username, allowed length 20
hpc_username – An optional string as Imperial College’s HPC login name, allowed length 20
twitter_user – An optional string as twitter user name, allowed length 20
category –
An optional enum list as user category, default NON_HPC_USER, allowed values are
HPC_USER
NON_HPC_USER
EXTERNAL
status –
An optional enum list as user status, default is ACTIVE, allowed values are
ACTIVE
BLOCKED
WITHDRAWN
date_created – An optional timestamp, default current timestamp
password – An optional string field to store encrypted password
encryption_salt – An optional string field to store encryption salt
ht_password – An optional field to store password for htaccess
Database adaptor api¶
Base adaptor¶
-
class
igf_data.igfdb.baseadaptor.
BaseAdaptor
(**data)¶ The base adaptor class
-
divide_data_to_table_and_attribute
(data, required_column, table_columns, attribute_name_column='attribute_name', attribute_value_column='attribute_value')¶ A method for separating data for main and attribute tables
- Parameters
data – a dictionary or dataframe containing the data
required_column – column to add to the attribute table, it must be part of the data
table_columns – required columns for the main table
attribute_name_column – column label for attribute name
attribute_value_column – column label for attribute value
- Returns
Two pandas dataframes, one for main table and one for attribute tables
-
fetch_records
(query, output_mode='dataframe')¶ A method for fetching records using a query
- Parameters
query – A sqlalchmeny query object
output_mode – dataframe / object / one / one_or_none
- Returns
A pandas dataframe for dataframe mode and a generator object for object mode
-
fetch_records_by_column
(table, column_name, column_id, output_mode)¶ A method for fetching record with the column
- Parameters
table – table name
column_name – a column name
column_id – a column id value
output_mode – dataframe / object / one / one_or_none
-
fetch_records_by_multiple_column
(table, column_data, output_mode)¶ A method for fetching record with the column
- Parameters
table – table name
column_dict – a dictionary of column_names: column_value
output_mode – dataframe / object/ one / one_or_none
-
get_attributes_by_dbid
(attribute_table, linked_table, linked_column_name, db_id)¶ A method for fetching attribute records for a specific attribute table with a db_id linked as foreign key
- Parameters
attribute_table – A attribute table object
linked_table – A main table object
linked_column_name – A table name to link main table
db_id – A unique id to link main table
:returns a dataframe of records
-
get_table_columns
(table_name, excluded_columns)¶ A method for fetching the columns for table table_name
- Parameters
table_name – a table class name
excluded_columns – a list of column names to exclude from output
-
map_foreign_table_and_store_attribute
(data, lookup_table, lookup_column_name, target_column_name)¶ A method for mapping foreign key id to the new column
- Parameters
data – a data dictionary or pandas series, to be stored in attribute table
lookup_table – a table class to look for the foreign key id
lookup_column_name – a string or a list of column names which will be used to link the data frame with lookup_table, this column will be removed from the output series
target_column_name – column name for the foreign key id
- Returns
A data series
-
store_attributes
(attribute_table, data, linked_column='', db_id='', mode='serial')¶ A method for storing attributes
- Parameters
attribute_table – a attribute table name
linked_column – a column name to link the db_id to attribute table
db_id – a db_id to link the attribute records
mode – serial / bulk
-
store_records
(table, data, mode='serial')¶ A method for loading data to table
- Parameters
table – name of the table class
:param data : pandas dataframe or a list of dictionary :param mode : serial / bulk
-
Project adaptor¶
-
class
igf_data.igfdb.projectadaptor.
ProjectAdaptor
(**data)¶ An adaptor class for Project, ProjectUser and Project_attribute tables
-
assign_user_to_project
(data, required_project_column='project_igf_id', required_user_column='email_id', data_authority_column='data_authority', autosave=True)¶ Load data to ProjectUser table
- Parameters
data – A list of dictionaries, each containing ‘project_igf_id’ and ‘user_igf_id’ as key with relevent igf ids as the values. An optional key ‘data_authority’ with boolean value can be provided to set the user as the data authority of the project E.g. [{‘project_igf_id’: val, ‘email_id’: val, ‘data_authority’:True},]
required_project_column – Name of the project id column, default project_igf_id
required_user_column – Name of the user id column, default email_id
data_authority_column – Name of the data_authority column, default data_authority
autosave – A toggle for autocommit to db, default True
- Returns
None
A method for checking user data authority for existing projects
- Parameters
project_igf_id – An unique project igf id
- Returns
True if data authority exists for project or false
-
check_existing_project_user
(project_igf_id, email_id)¶ A method for checking existing project use info in database
- Parameters
project_igf_id – A project_igf_id
email_id – An email_id
- Returns
True if the file is present in db or False if its not
-
check_project_attributes
(project_igf_id, attribute_name)¶ A method for checking existing project attribute in database
- Parameters
project_igf_id – An unique project igf id
attribute_name – An attribute name
:return A boolean value
-
check_project_records_igf_id
(project_igf_id, target_column_name='project_igf_id')¶ A method for checking existing data for Project table
- Parameters
project_igf_id – Project igf id name
target_column_name – Name of the project id column, default project_igf_id
- Returns
True if the file is present in db or False if its not
-
count_project_samples
(project_igf_id, only_active=True)¶ A method for counting total number of samples for a project
- Parameters
project_igf_id – A project id
only_active – Toggle for including only active projects, default is True
- Returns
A int sample count
-
divide_data_to_table_and_attribute
(data, required_column='project_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')¶ A method for separating data for Project and Project_attribute tables
- Parameters
data – A list of dictionaries or a pandas dataframe
table_columns – List of table column names, default None
required_column – Name of the required column, default project_igf_id
attribute_name_column – Value for attribute name column, default attribute_name
attribute_value_column – Valye for attribute value column, default attribute_value
- Returns
A project dataframe and a project attribute dataframe
-
fetch_all_project_igf_ids
(output_mode='dataframe')¶ A method for fetching a list of all project igf ids
- Parameters
output_mode – Output mode, default dataframe
A method for fetching user data authority for existing projects
- Parameters
project_igf_id – An unique project igf id
- Returns
A user object or None, if no entry found
-
fetch_project_records_igf_id
(project_igf_id, target_column_name='project_igf_id')¶ A method for fetching data for Project table
- Parameters
project_igf_id – an igf id
output_mode – dataframe / object / one
- Returns
Records from project table
-
fetch_project_samples
(project_igf_id, only_active=True, output_mode='object')¶ A method for fetching all the samples for a specific project
- Parameters
project_igf_id – A project id
only_active – Toggle for including only active projects, default is True
output_mode – Output mode, default object
- Returns
Depends on the output_mode, a generator expression, dataframe or an object
-
get_project_attributes
(project_igf_id, linked_column_name='project_id', attribute_name='')¶ A method for fetching entries from project attribute table
- Parameters
project_igf_id – A project_igf_id string
attribute_name – An attribute name, default in None
linked_column_name – A column name for linking attribute table
:returns dataframe of records
-
get_project_user_info
(output_mode='dataframe', project_igf_id='')¶ A method for fetching information from Project, User and ProjectUser table
- Parameters
project_igf_id – a project igf id
:param output_mode : dataframe / object :returns: Records for project user
-
store_project_and_attribute_data
(data, autosave=True)¶ A method for dividing and storing data to project and attribute_table
- Parameters
data – A list of data or a pandas dataframe
autosave – A toggle for autocommit, default True
- Returns
None
-
store_project_attributes
(data, project_id='', autosave=False)¶ A method for storing data to Project_attribute table
- Parameters
data – A pandas dataframe
project_id – Project id for attribute table, default ‘’
autosave – A toggle for autocommit, default False
- Returns
None
-
store_project_data
(data, autosave=False)¶ Load data to Project table
- Parameters
data – A list of data or a pandas dataframe
autosave – A toggle for autocommit, default False
- Returns
None
-
User adaptor¶
-
class
igf_data.igfdb.useradaptor.
UserAdaptor
(**data)¶ An adaptor class for table User
-
check_user_records_email_id
(email_id)¶ A method for checking existing user data in db
- Parameters
email_id – An email id
- Returns
True if the file is present in db or False if its not
-
fetch_user_records_email_id
(user_email_id)¶ A method for fetching data for User table
- Parameters
user_email_id – an email id
- Returns
user object
-
fetch_user_records_igf_id
(user_igf_id)¶ A method for fetching data for User table
- Parameters
user_igf_id – an igf id
- Returns
user object
-
store_user_data
(data, autosave=True)¶ Load data to user table
- Parameters
data – A pandas dataframe
autosave – A toggle for autocommit, default True
- Returns
None
-
Sample adaptor¶
-
class
igf_data.igfdb.sampleadaptor.
SampleAdaptor
(**data)¶ An adaptor class for Sample and Sample_attribute tables
-
check_project_and_sample
(project_igf_id, sample_igf_id)¶ A method for checking existing project and sample igf id combination in sample table
- Parameters
project_igf_id – A project igf id string
sample_igf_id – A sample igf id string
- Returns
True if target entry is present or return False
-
check_sample_records_igf_id
(sample_igf_id, target_column_name='sample_igf_id')¶ A method for checking existing data for sample table
- Parameters
sample_igf_id – an igf id
target_column_name – name of the target lookup column, default sample_igf_id
- Returns
True if the file is present in db or False if its not
-
divide_data_to_table_and_attribute
(data, required_column='sample_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')¶ A method for separating data for Sample and Sample_attribute tables
- Parameters
data – A list of dictionaries or a pandas dataframe
table_columns – List of table column names, default None
required_column – column name to add to the attribute data
attribute_name_column – label for attribute name column
attribute_value_column – label for attribute value column
- Returns
Two pandas dataframes, one for Sample and another for Sample_attribute table
-
fetch_sample_project
(sample_igf_id)¶ A method for fetching project information for the sample
- Parameters
sample_igf_id – A sample_igf_id for database lookup
- Returns
A project_igf_id or None, if not found
-
fetch_sample_records_igf_id
(sample_igf_id, target_column_name='sample_igf_id')¶ A method for fetching data for Sample table
- Parameters
sample_igf_id – A sample igf id
output_mode – dataframe, object, one or on_on_none
- Returns
An object or dataframe, based on the output_mode
-
store_sample_and_attribute_data
(data, autosave=True)¶ A method for dividing and storing data to sample and attribute table
-
store_sample_attributes
(data, sample_id='', autosave=False)¶ A method for storing data to Sample_attribute table
- Parameters
data – A dataframe or list of dictionary containing the Sample_attribute data
sample_id – An optional parameter to link the sample attributes to a specific sample
-
store_sample_data
(data, autosave=False)¶ Load data to Sample table
- Parameters
data – A dataframe or list of dictionary containing the data
-
Experiment adaptor¶
-
class
igf_data.igfdb.experimentadaptor.
ExperimentAdaptor
(**data)¶ An adaptor class for Experiment and Experiment_attribute tables
-
check_experiment_records_id
(experiment_igf_id, target_column_name='experiment_igf_id')¶ A method for checking existing data for Experiment table
- Parameters
experiment_igf_id – an igf id
target_column_name – a column name, default experiment_igf_id
- Returns
True if the file is present in db or False if its not
-
divide_data_to_table_and_attribute
(data, required_column='experiment_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')¶ A method for separating data for Experiment and Experiment_attribute tables
- Parameters
data – A list of dictionaries or a Pandas DataFrame
table_columns – List of table column names, default None
required_column – column name to add to the attribute data
attribute_name_column – label for attribute name column
attribute_value_column – label for attribute value column
- Returns
Two pandas dataframes, one for Experiment and another for Experiment_attribute table
-
fetch_experiment_records_id
(experiment_igf_id, target_column_name='experiment_igf_id')¶ A method for fetching data for Experiment table
- Parameters
experiment_igf_id – an igf id
target_column_name – a column name, default experiment_igf_id
- Returns
Experiment object
-
fetch_project_and_sample_for_experiment
(experiment_igf_id)¶ A method for fetching project and sample igf id information for an experiment
- Parameters
experiment_igf_id – An experiment igf id string
- Returns
Two strings, project igf id and sample igd id, or None if not found
-
fetch_runs_for_igf_id
(experiment_igf_id, include_active_runs=True, output_mode='dataframe')¶ A method for fetching all the runs for a specific experiment_igf_id
- Parameters
experiment_igf_id – An experiment_igf_id
include_active_runs – Include only active runs, if its True, default True
output_mode – Record fetch mode, default dataframe
-
fetch_sample_attribute_records_for_experiment_igf_id
(experiment_igf_id, output_mode='dataframe', attribute_list=None)¶ A method for fetching sample_attribute_records for a given experiment_igf_id
- Parameters
experiment_igf_id – An experiment_igf_id
output_mode – Result output mode, default dataframe
attribute_list – A list of attributes for database lookup, default None
:returns an object or dataframe based on the output_mode
-
store_experiment_attributes
(data, experiment_id='', autosave=False)¶ A method for storing data to Experiment_attribute table
- Parameters
data – A list of dictionaries or a Pandas DataFrame for experiment attribute data
experiment_id – An optional experiment_id to link attribute records
autosave – A toggle for automatically saving data to db, default True
-
store_experiment_data
(data, autosave=False)¶ Load data to Experiment table
- Parameters
data – A list of dictionaries or a Pandas DataFrame
autosave – A toggle for automatically saving data to db, default True
-
store_project_and_attribute_data
(data, autosave=True)¶ A method for dividing and storing data to experiment and attribute table
- Parameters
data – A list of dictionaries or a Pandas DataFrame
autosave – A toggle for automatically saving data to db, default True
-
update_experiment_records_by_igf_id
(experiment_igf_id, update_data, autosave=True)¶ A method for updating experiment records in database
- Parameters
experiment_igf_id – An igf ids for the experiment data lookup
update_data – A dictionary containing the updated entries
autosave – Toggle auto commit after database update, default True
-
Run adaptor¶
-
class
igf_data.igfdb.runadaptor.
RunAdaptor
(**data)¶ An adaptor class for Run and Run_attribute tables
-
check_run_records_igf_id
(run_igf_id, target_column_name='run_igf_id')¶ A method for existing data for Run table
- Parameters
run_igf_id – an igf id
target_column_name – a column name, default run_igf_id
- Returns
True if the file is present in db or False if its not
-
divide_data_to_table_and_attribute
(data, required_column='run_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')¶ A method for separating data for Run and Run_attribute tables
- Parameters
data – A list of dictionaries or a Pandas DataFrame
table_columns – List of table column names, default None
required_column – column name to add to the attribute data
attribute_name_column – label for attribute name column
attribute_value_column – label for attribute value column
- Returns
Two pandas dataframes, one for Run and another for Run_attribute table
-
fetch_flowcell_and_lane_for_run
(run_igf_id)¶ A run adapter method for fetching flowcell id and lane info for each run
- Parameters
run_igf_id – A run igf id string
- Returns
Flowcell id and lane number It will return None if no records found
-
fetch_project_sample_and_experiment_for_run
(run_igf_id)¶ A method for fetching project, sample and experiment information for a run
- Parameters
run_igf_id – A run igf id string
- Returns
A list of three strings, or None if not found * project_igf_id * sample_igf_id * experiment_igf_id
-
fetch_run_records_igf_id
(run_igf_id, target_column_name='run_igf_id')¶ A method for fetching data for Run table
- Parameters
run_igf_id – an igf id
target_column_name – a column name, default run_igf_id
-
fetch_sample_info_for_run
(run_igf_id)¶ A method for fetching sample information linked to a run_igf_id
- Parameters
run_igf_id – A run_igf_id to search database
-
store_run_and_attribute_data
(data, autosave=True)¶ A method for dividing and storing data to run and attribute table
- Parameters
data – A list of dictionaries or a Pandas DataFrame containing the run data
autosave – A toggle for saving data automatically to db, default True
-
store_run_attributes
(data, run_id='', autosave=False)¶ A method for storing data to Run_attribute table
- Parameters
data – A list of dictionaries or a Pandas DataFrame containing the attribute data
autosave – A toggle for saving data automatically to db, default True
-
store_run_data
(data, autosave=False)¶ A method for loading data to Run table
- Parameters
data – A list of dictionaries or a Pandas DataFrame containing the attribute data
autosave – A toggle for saving data automatically to db, default True
-
Collection adaptor¶
-
class
igf_data.igfdb.collectionadaptor.
CollectionAdaptor
(**data)¶ An adaptor class for Collection, Collection_group and Collection_attribute tables
-
check_collection_attribute
(collection_name, collection_type, attribute_name)¶ A method for checking collection attribute records for an attribute_name
- Parameters
collection_name – A collection name
collection_type – A collection type
attribute_name – A collection attribute name
- Returns
Boolean, True if record exists or False
-
check_collection_records_name_and_type
(collection_name, collection_type)¶ A method for checking existing data for Collection table
- Parameters
collection_name – a collection name value
collection_type – a collection type value
- Returns
True if the file is present in db or False if its not
-
create_collection_group
(data, autosave=True, required_collection_column=('name', 'type'), required_file_column='file_path')¶ A function for creating collection group, a link between a file and a collection
- Parameters
data –
- A list dictionary or a Pandas DataFrame with following columns
name
type
file_path
E.g. [{‘name’:’a collection name’, ‘type’:’a collection type’, ‘file_path’: ‘path’},]
required_collection_column – List of required column for fetching collection, default ‘name’,’type’
required_file_column – Required column for fetching file information, default file_path
autosave – A toggle for saving changes to database, default True
-
create_or_update_collection_attributes
(data, autosave=True)¶ A method for creating or updating collection attribute table, if the collection exists
- Parameters
data –
A list of dictionaries, containing following entries
name
type
attribute_name
attribute_value
autosave – A toggle for saving changes to database, default True
-
divide_data_to_table_and_attribute
(data, required_column=('name', 'type'), table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')¶ A method for separating data for Collection and Collection_attribute tables
- Parameters
data – A list of dictionaries or a pandas dataframe
table_columns – List of table column names, default None
required_column – column name to add to the attribute data, default ‘name’, ‘type’
attribute_name_column – label for attribute name column, default attribute_name
attribute_value_column – label for attribute value column, default attribute_value
- Returns
Two pandas dataframes, one for Collection and another for Collection_attribute table
-
fetch_collection_name_and_table_from_file_path
(file_path)¶ A method for fetching collection name and collection_table info using the file_path information. It will return None if the file doesn’t have any collection present in the database
- Parameters
file_path – A filepath info
- Returns
Collection name and collection table for first collection group
-
fetch_collection_records_name_and_type
(collection_name, collection_type, target_column_name=('name', 'type'))¶ A method for fetching data for Collection table
- Parameters
collection_name – a collection name value
collection_type – a collection type value
target_column_name – a list of columns, default is [‘name’,’type’]
-
get_collection_files
(collection_name, collection_type='', collection_table='', output_mode='dataframe')¶ A method for fetching information from Collection, File, Collection_group tables
- Parameters
collection_name – A collection name to fetch the linked files
collection_type – A collection type
collection_table – A collection table
output_mode – dataframe / object
-
load_file_and_create_collection
(data, autosave=True, hasher='md5', calculate_file_size_and_md5=True, required_coumns=('name', 'type', 'table', 'file_path', 'size', 'md5', 'location'))¶ A function for loading files to db and creating collections
- Parameters
data – A list of dictionary or a Pandas dataframe
autosave – Save data to db, default True
required_coumns – List of required columns
hasher – Method for file checksum, default md5
calculate_file_size_and_md5 – Enable file size and md5 check, default True
-
static
prepare_data_for_collection_attribute
(collection_name, collection_type, data_list)¶ A static method for building data structure for collection attribute table update
- Parameters
collection_name – A collection name
collection_type – A collection type
data – A list of dictionaries containing the data for attribute table
- Returns
A new list of dictionary for the collection attribute table
-
remove_collection_group_info
(data, autosave=True, required_collection_column=('name', 'type'), required_file_column='file_path')¶ A method for removing collection group information from database
- Parameters
data –
- A list dictionary or a Pandas DataFrame with following columns
name
type
file_path
File_path information is not mandatory
required_collection_column – List of required column for fetching collection, default ‘name’,’type’
required_file_column – Required column for fetching file information, default file_path
autosave – A toggle for saving changes to database, default True
-
store_collection_and_attribute_data
(data, autosave=True)¶ A method for dividing and storing data to collection and attribute table
- Parameters
data – A list of dictionary or a Pandas DataFrame
autosave – A toggle for saving changes to database, default True
-
store_collection_attributes
(data, collection_id='', autosave=False)¶ A method for storing data to Collectionm_attribute table
- Parameters
data – A list of dictionary or a Pandas DataFrame
collection_id – A collection id, optional
autosave – A toggle for saving changes to database, default False
-
store_collection_data
(data, autosave=False)¶ A method for loading data to Collection table
- Parameters
data – A list of dictionary or a Pandas DataFrame
autosave – A toggle for saving changes to database, default True
-
update_collection_attribute
(collection_name, collection_type, attribute_name, attribute_value, autosave=True)¶ A method for updating collection attribute
- Parameters
collection_name – A collection name
collection_type – A collection type
attribute_name – A collection attribute name
attribute_value – A collection attribute value
autosave – A toggle for committing changes to db, default True
-
File adaptor¶
-
class
igf_data.igfdb.fileadaptor.
FileAdaptor
(**data)¶ An adaptor class for File tables
-
check_file_records_file_path
(file_path)¶ A method for checking file information in database
- Parameters
file_path – A absolute filepath
- Returns
True if the file is present in db or False if its not
-
divide_data_to_table_and_attribute
(data, required_column='file_path', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')¶ A method for separating data for File and File_attribute tables
- Parameters
data – A list of dictionary or a Pandas DataFrame
table_columns – List of table column names, default None
required_column – A column name to add to the attribute data
attribute_name_column – A label for attribute name column
attribute_value_column – A label for attribute value column
- Returns
Two pandas dataframes, one for File and another for File_attribute table
-
fetch_file_records_file_path
(file_path)¶ A method for fetching data for file table
- Parameters
file_path – an absolute file path
- Returns
A file object
-
remove_file_data_for_file_path
(file_path, remove_file=False, autosave=True)¶ A method for removing entry for a specific file.
- Parameters
file_path – A complete file_path for checking database
remove_file – A toggle for removing filepath, default False
autosave – A toggle for automatically saving changes to database, default True
-
store_file_and_attribute_data
(data, autosave=True)¶ A method for dividing and storing data to file and attribute table
- Parameters
data – A list of dictionary or a Pandas DataFrame
autosave – A Toggle for automatically saving changes to db, default True
-
store_file_attributes
(data, file_id='', autosave=False)¶ A method for storing data to File_attribute table
- Parameters
data – A list of dictionary or a Pandas DataFrame
file_id – A file_id for updating the attribute table, default empty string
autosave – A Toggle for automatically saving changes to db, default True
-
store_file_data
(data, autosave=False)¶ Load data to file table
- Parameters
data – A list of dictionary or a Pandas DataFrame
autosave – A Toggle for automatically saving changes to db, default True
-
update_file_table_for_file_path
(file_path, tag, value, autosave=False)¶ A method for updating file table
- Parameters
file_path – A file_path for database look up
tag – A keyword for file column name
value – A new value for the file column
autosave – Toggle autosave, default off
-
Sequencing run adaptor¶
-
class
igf_data.igfdb.seqrunadaptor.
SeqrunAdaptor
(**data)¶ An adaptor class for table Seqrun
-
divide_data_to_table_and_attribute
(data, required_column='seqrun_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')¶ A method for separating data for Seqrun and Seqrun_attribute tables
- Parameters
data – A list of dictionaries or a pandas dataframe
table_columns – List of table column names, default None
required_column – column name to add to the attribute data
attribute_name_column – label for attribute name column
attribute_value_column – label for attribute value column
- Returns
two pandas dataframes, one for Seqrun and another for Run_attribute table
-
fetch_flowcell_barcode_rules_for_seqrun
(seqrun_igf_id, flowcell_label='flowcell')¶ A method for fetching flowcell barcode rule for Seqrun required param: seqrun_igf_id: A seqrun igf id
-
fetch_seqrun_records_igf_id
(seqrun_igf_id, target_column_name='seqrun_igf_id')¶ A method for fetching data for Seqrun table required params: seqrun_igf_id: an igf id target_column_name: a column name in the Seqrun table, default seqrun_igf_id
-
store_seqrun_and_attribute_data
(data, autosave=True)¶ A method for dividing and storing data to seqrun and attribute table
-
store_seqrun_attributes
(data, seqrun_id='', autosave=False)¶ A method for storing data to Seqrun_attribute table
-
store_seqrun_data
(data, autosave=False)¶ Load data to Seqrun table
-
store_seqrun_stats_data
(data, seqrun_id='', autosave=True)¶ A method for storing data to seqrun_stats table
-
Platform adaptor¶
-
class
igf_data.igfdb.platformadaptor.
PlatformAdaptor
(**data)¶ An adaptor class for Platform tables
-
fetch_platform_records_igf_id
(platform_igf_id, target_column_name='platform_igf_id', output_mode='one')¶ A method for fetching data for Platform table
- Parameters
platform_igf_id – an igf id
target_column_name – column name in the Platform table, default is platform_igf_id
-
store_flowcell_barcode_rule
(data, autosave=True)¶ Load data to flowcell_barcode_rule table required params: data: A dictionary or dataframe containing following columns
platform_igf_id / platform_id
flowcell_type
index_1 (NO_CHANGE/REVCOMP/UNKNOWN)
index_2 (NO_CHANGE/REVCOMP/UNKNOWN)
-
store_platform_data
(data, autosave=True)¶ Load data to Platform table
-
Pipeline adaptor¶
-
class
igf_data.igfdb.pipelineadaptor.
PipelineAdaptor
(**data)¶ An adaptor class for Pipeline and Pipeline_seed tables
-
create_pipeline_seed
(data, autosave=True, status_column='status', seeded_label='SEEDED', required_columns=('pipeline_id', 'seed_id', 'seed_table'))¶ A method for creating new entry in th pipeline_seed table
- Parameters
data – Dataframe or hash, it sould contain following fields * pipeline_name / pipeline_id * seed_id * seed_table
-
fetch_pipeline_records_pipeline_name
(pipeline_name, target_column_name='pipeline_name')¶ A method for fetching data for Pipeline table
- Parameters
pipeline_name – a name
target_column_name – default pipeline_name
-
fetch_pipeline_seed
(pipeline_id, seed_id, seed_table, target_column_name=('pipeline_id', 'seed_id', 'seed_table'))¶ A method for fetching unique pipeline seed using pipeline_id, seed_id and seed_table
- Parameters
pipeline_id – A pipeline db id
seed_id – A seed entry db id
seed_table – A seed table name
target_column_name – Target set of columns
-
fetch_pipeline_seed_with_table_data
(pipeline_name, table_name='seqrun', status='SEEDED')¶ A method for fetching linked table records for the seeded entries in pipeseed table
- Parameters
pipeline_name – A pipeline name
table_name – A table name for pipeline_seed lookup, default seqrun
status – A text label for seeded status, default is SEEDED
- Returns
Two pandas dataframe for pipeline_seed entries and data from other tables
-
seed_new_experiments
(pipeline_name, species_name_list, fastq_type, project_list=None, library_source_list=None, active_status='ACTIVE', autosave=True, seed_table='experiment')¶ A method for seeding new experiments for primary analysis
- Parameters
pipeline_name – Name of the analysis pipeline
project_list – List of projects to consider for seeding analysis pipeline, default None
library_source_list – List of library source to consider for analysis, default None
species_name_list – List of sample species to consider for seeding analysis pipeline
active_status – Label for active status, default ACTIVE
autosave – A toggle for autosaving records in database, default True
seed_tabel – Seed table for pipeseed table, default experiment
- Returns
A list of available projects for seeding analysis table (if project_list is None) or None and a list of seeded experiments or None
-
seed_new_seqruns
(pipeline_name, autosave=True, seed_table='seqrun')¶ A method for creating seed for new seqruns
- Parameters
pipeline_name – A pipeline name
-
store_pipeline_data
(data, autosave=True)¶ Load data to Pipeline table
-
update_pipeline_seed
(data, autosave=True, required_columns=('pipeline_id', 'seed_id', 'seed_table', 'status'))¶ A method for updating the seed status in pipeline_seed table
- Parameters
data – dataframe or a hash, should contain following fields * pipeline_name / pipeline_id * seed_id * seed_table * status
-
Utility functions for database access¶
Database utility functions¶
-
igf_data.utils.dbutils.
clean_and_rebuild_database
(dbconfig)¶ A method for deleting data in database and create empty tables
- Parameters
dbconfig – A json file containing the database connection info
-
igf_data.utils.dbutils.
read_dbconf_json
(dbconfig)¶ A method for reading dbconfig json file
- Parameters
dbconfig – A json file containing the database connection info e.g. {“dbhost”:”DBHOST”,”dbport”: PORT,”dbuser”:”USER”,”dbpass”:”DBPASS”,”dbname”:”DBNAME”,”driver”:”mysql”,”connector”:”pymysql”}
- Returns
a dictionary containing dbparms
-
igf_data.utils.dbutils.
read_json_data
(data_file)¶ A method for reading data from json file
- Parameters
data_file – A Json format file
- Returns
A list of dictionaries
Project adaptor utility functions¶
-
igf_data.utils.projectutils.
draft_email_for_project_cleanup
(template_file, data, draft_output)¶ A method for drafting email for cleanup
- Parameters
template_file – A template file
data –
A list of dictionary or a dictionary containing the following columns
name
email_id
projects
cleanup_date
draft_output – A output filename
-
igf_data.utils.projectutils.
find_projects_for_cleanup
(dbconfig_file, warning_note_weeks=24, all_warning_note=False)¶ A function for finding old projects for cleanup
- Parameters
dbconfig_file – A dbconfig file path
warning_note_weeks – Number of weeks from last sequencing run to wait before sending warnings, default 24
all_warning_note – A toggle for sending warning notes to all, default False
- Returns
A list containing warning lists, a list containing final note list and another list with clean up list
-
igf_data.utils.projectutils.
get_files_and_irods_path_for_project
(project_igf_id, db_session_class, irods_path_prefix='/igfZone/home/')¶ A function for listing all the files and irods dir path for a given project
- Parameters
project_igf_id – A string containing the project igf id
db_session_class – A database session object
irods_path_prefix – A string containing irods path prefix, default ‘/igfZone/home/’
- Returns
A list containing all the files for a project and a string containing the irods path for the project
-
igf_data.utils.projectutils.
get_project_read_count
(project_igf_id, session_class, run_attribute_name='R1_READ_COUNT', active_status='ACTIVE')¶ A utility method for fetching sample read counts for an input project_igf_id
- Parameters
project_igf_id – A project_igf_id string
session_class – A db session class object
run_attribute_name – Attribute name from Run_attribute table for read count lookup
active_status – text label for active runs, default ACTIVE
- Returns
A pandas dataframe containing following columns
project_igf_id
sample_igf_id
flowcell_id
attribute_value
-
igf_data.utils.projectutils.
get_seqrun_info_for_project
(project_igf_id, session_class)¶ A utility method for fetching seqrun_igf_id and flowcell_id which are linked to a specific project_igf_id
- Parameters
project_igf_id – A project_igf_id string
session_class – A db session class object
- Returns
A pandas dataframe containing following columns
seqrun_igf_id
flowcell_id
-
igf_data.utils.projectutils.
mark_project_and_list_files_for_cleanup
(project_igf_id, dbconfig_file, outout_dir, force_overwrite=True, use_ephemeral_space=False, irods_path_prefix='/igfZone/home/', withdrawn_tag='WITHDRAWN')¶ A wrapper function for project cleanup operation
- Parameters
project_igf_id – A string of project igf -id
dbconfig_file – A dbconf json file path
outout_dir – Output dir path for dumping file lists for project
force_overwrite – Overwrite existing output file, default True
use_ephemeral_space – A toggle for temp dir, default False
irods_path_prefix – Prefix for irods path, default /igfZone/home/
withdrawn_tag – A string tag for marking files in db, default WITHDRAWN
- Returns
None
-
igf_data.utils.projectutils.
mark_project_as_withdrawn
(project_igf_id, db_session_class, withdrawn_tag='WITHDRAWN')¶ A function for marking all the entries for a specific project as withdrawn
- Parameters
project_igf_id – A string containing the project igf id
db_session_class – A dbsession object
withdrawn_tag – A string for withdrawn field in db, default WITHDRAWN
- Returns
None
-
igf_data.utils.projectutils.
mark_project_barcode_check_off
(project_igf_id, session_class, barcode_check_attribute='barcode_check', barcode_check_val='OFF')¶ A utility method for marking project barcode check as off using the project_igf_id
- Parameters
project_igf_id – A project_igf_id string
session_class – A db session class object
barcode_check_attribute – A text keyword for barcode check attribute, default barcode_check
barcode_check_val – A text for barcode check attribute value, default is ‘OFF’
- Returns
None
-
igf_data.utils.projectutils.
notify_project_for_cleanup
(warning_template, final_notice_template, cleanup_template, warning_note_list, final_note_list, cleanup_list, use_ephemeral_space=False)¶ A function for sending emails to users for project cleanup
- Parameters
warning_template – A email template file for warning
final_notice_template – A email template for final notice
cleanup_template – A email template for sending cleanup list to igf
warning_note_list –
A list of dictionary containing following fields to warn user about cleanup
name
email_id
projects
cleanup_date
final_note_list – A list of dictionary containing above mentioned fields to noftify user about final cleanup
cleanup_list – A list of dictionary containing above mentioned fields to list projects for cleanup
use_ephemeral_space – A toggle for using the ephemeral space, default False
-
igf_data.utils.projectutils.
send_email_to_user_via_sendmail
(draft_email_file, waiting_time=20, sendmail_exe='sendmail', dry_run=False)¶ A function for sending email to users via sendmail
- Parameters
draft_email_file – A draft email to be sent to user
waiting_time – Wait after sending the email, default 20sec
sendmail_exe – Sendmail exe path, default sendmail
dry_run – A toggle for dry run, default False
Sequencing adaptor utility functions¶
-
igf_data.utils.seqrunutils.
get_seqrun_date_from_igf_id
(seqrun_igf_id)¶ A utility method for fetching sequence run date from the igf id
required params: seqrun_igf_id: A seqrun igf id string
returns a string value of the date
-
igf_data.utils.seqrunutils.
load_new_seqrun_data
(data_file, dbconfig)¶ A method for loading new data for seqrun table
Pipeline adaptor utility functions¶
-
igf_data.utils.pipelineutils.
find_new_analysis_seeds
(dbconfig_path, pipeline_name, project_name_file, species_name_list, fastq_type, library_source_list)¶ A utils method for finding and seeding new experiments for analysis
- Parameters
dbconfig_path – A database configuration file
slack_config – A slack configuration file
:param pipeline_name:Pipeline name :param fastq_type: Fastq collection type :param project_name_file: A file containing the list of projects for seeding pipeline :param species_name_list: A list of species to consider for seeding analysis :param library_source_list: A list of library source info to consider for seeding analysis :returns: List of available experiments or None and a list of seeded experiments or None
-
igf_data.utils.pipelineutils.
load_new_pipeline_data
(data_file, dbconfig)¶ A method for loading new data for pipeline table
Platform adaptor utility functions¶
-
igf_data.utils.platformutils.
load_new_flowcell_data
(data_file, dbconfig)¶ A method for loading new data to flowcell table
-
igf_data.utils.platformutils.
load_new_platform_data
(data_file, dbconfig)¶ A method for loading new data for platform table
Pipeline seed adaptor utility functions¶
-
igf_data.utils.ehive_utils.pipeseedfactory_utils.
get_pipeline_seeds
(pipeseed_mode, pipeline_name, igf_session_class, seed_id_label='seed_id', seqrun_date_label='seqrun_date', seqrun_id_label='seqrun_id', experiment_id_label='experiment_id', seqrun_igf_id_label='seqrun_igf_id')¶ A utils function for fetching pipeline seed information
- Parameters
pipeseed_mode – A string info about pipeseed mode, allowed values are demultiplexing alignment
pipeline_name – A string infor about pipeline name
igf_session_class – A database session class for pipeline seed lookup
- Returns
Two Pandas dataframes, first with pipeseed entries and second with seed info
IGF pipeline api¶
Pipeline api¶
Fetch fastq files for analysis¶
-
igf_data.utils.analysis_fastq_fetch_utils.
get_fastq_input_list
(db_session_class, experiment_igf_id, combine_fastq_dir=False, fastq_collection_type='demultiplexed_fastq', active_status='ACTIVE')¶ A function for fetching all the fastq files linked to a specific experiment id
- Parameters
db_session_class – A database session class
experiment_igf_id – An experiment igf id
fastq_collection_type – Fastq collection type name, default demultiplexed_fastq
active_status – text label for active runs, default ACTIVE
combine_fastq_dir – Combine fastq file directories for output line, default False
- Returns
A list of fastq file or fastq dir paths for the analysis run
- Raises
ValueError – It raises ValueError if no fastq directory found
Load analysis result to database and file system¶
-
class
igf_data.utils.analysis_collection_utils.
Analysis_collection_utils
(dbsession_class, base_path=None, collection_name=None, collection_type=None, collection_table=None, rename_file=True, add_datestamp=True, tag_name=None, analysis_name=None, allowed_collection=('sample', 'experiment', 'run', 'project'))¶ A class for dealing with analysis file collection. It has specific method for moving analysis files to a specific directory structure and rename the file using a uniform rule, if required. Example ‘<collection_name>_<analysis_name>_<tag>_<datestamp>.<original_suffix>’
- Parameters
dbsession_class – A database session class
collection_name – Collection name information for file, default None
collection_type – Collection type information for file, default None
collection_table – Collection table information for file, default None
base_path – A base filepath to move file while loading, default ‘None’ for no file move
rename_file – Rename file based on collection_table type while loading, default True
add_datestamp – Add datestamp while loading the file
analysis_name – Analysis name for the file, required for renaming while loading, default None
tag_name – Additional tag for filename,default None
allowed_collection –
List of allowed collection tables
sample, experiment, run, project
-
create_or_update_analysis_collection
(file_path, dbsession, withdraw_exisitng_collection=True, autosave_db=True, force=True, remove_file=False)¶ A method for create or update analysis file collection in db. Required elements will be collected from database if base_path element is given.
- Parameters
file_path – file path to load as db collection
dbsession – An active database session
withdraw_exisitng_collection – Remove existing collection group
autosave_db – Save changes to database, default True
remove_file – A toggle for removing existing file from disk, default False
force – Toggle for removing existing file collection, default True
-
get_new_file_name
(input_file, file_suffix=None)¶ A method for fetching new file name
- Parameters
input_file – An input filepath
file_suffix – A file suffix
-
load_file_to_disk_and_db
(input_file_list, withdraw_exisitng_collection=True, autosave_db=True, file_suffix=None, force=True, remove_file=False)¶ A method for loading analysis results to disk and database. File will be moved to a new path if base_path is present. Directory structure of the final path is based on the collection_table information.
Following will be the final directory structure if base_path is present
project - base_path/project_igf_id/analysis_name sample - base_path/project_igf_id/sample_igf_id/analysis_name experiment - base_path/project_igf_id/sample_igf_id/experiment_igf_id/analysis_name run - base_path/project_igf_id/sample_igf_id/experiment_igf_id/run_igf_id/analysis_name
- Parameters
input_file_list – A list of input file to load, all using the same collection info
withdraw_exisitng_collection – Remove existing collection group, DO NOT use this while loading a list of files
autosave_db – Save changes to database, default True
file_suffix – Use a specific file suffix, use None if it should be same as original file e.g. input.vcf.gz to output.vcf.gz
force – Toggle for removing existing file, default True
remove_file – A toggle for removing existing file from disk, default False
- Returns
A list of final filepath
Run metadata validation checks¶
-
class
igf_data.utils.validation_check.metadata_validation.
Validate_project_and_samplesheet_metadata
(samplesheet_file, metadata_files, samplesheet_schema, metadata_schema, samplesheet_name='SampleSheet.csv')¶ A package for running validation checks for project and samplesheet metadata file
- Parameters
samplesheet_file – A samplesheet input file
metadata_files – A list of metadata input file
samplesheet_schema – A json schema for samplesheet file validation
metadata_schema – A json schema for metadata file validation
-
static
check_metadata_library_by_row
(data)¶ A static method for checking library type metadata per row
- Parameters
data – A pandas data series containing sample metadata
- Returns
An error message or None
-
compare_metadata
()¶ A function for comparing samplesheet and metadata files
- Returns
A list of error or an empty list
-
convert_errors_to_gviz
(output_json=None)¶ A method for converting the list of errors to gviz format json
- Parameters
output_json – A output json file for saving data, default None
- Returns
A gviz json data block for the html output if output_json is None, or else None
-
dump_error_to_csv
(output_csv)¶ A method for dumping list or errors to a csv file :returns: output csv file path if any errors found, or else None
-
get_merged_errors
()¶ A method for running the validation checks on input samplesheet metadata and samplesheet files :returns: A list of errors or an empty list
-
get_metadata_validation_report
()¶ A method for running validation check on input metdata files :returns: A list of errors or an empty list
-
get_samplesheet_validation_report
()¶ A method for running validation checks on input samplesheet file :returns: A list of errors or an empty list
-
static
validate_metadata_library_type
(sample_id, library_source, library_strategy, experiment_type)¶ A staticmethod for validating library metadata information for sample
- Parameters
sample_id – Sample name
library_source – Library source information
library_strategy – Library strategy information
experiment_type – Experiment type information
- Returns
A error message string or None
Generic utility functions¶
Basic fasta sequence processing¶
-
igf_data.utils.sequtils.
rev_comp
(input_seq)¶ A function for converting nucleotide sequence to its reverse complement
- Parameters
input_seq – A string of nucleotide sequence
- Returns
Reverse complement version of the input sequence
Advanced fastq file processing¶
-
igf_data.utils.fastq_utils.
compare_fastq_files_read_counts
(r1_file, r2_file)¶ A method for comparing read counts for fastq pairs
- Parameters
r1_file – Fastq pair R1 file path
r2_file – Fastq pair R2 file path
- Raises
ValueError if counts are not same
-
igf_data.utils.fastq_utils.
count_fastq_lines
(fastq_file)¶ A method for counting fastq lines
- Parameters
fastq_file – A gzipped or unzipped fastq file
- Returns
Fastq line count
-
igf_data.utils.fastq_utils.
detect_non_fastq_in_file_list
(input_list)¶ A method for detecting non fastq file within a list of input fastq
- Parameters
input_list – A list of filepath to check
- Returns
True in non fastq files are present or else False
-
igf_data.utils.fastq_utils.
identify_fastq_pair
(input_list, sort_output=True, check_count=False)¶ A method for fastq read pair identification
- Parameters
input_list – A list of input fastq files
sort_output – Sort output list, default true
check_count – Check read count for fastq pair, only available if sort_output is True, default False
- Returns
A list for read1 files and another list of read2 files
Process local and remote files¶
-
igf_data.utils.fileutils.
calculate_file_checksum
(filepath, hasher='md5')¶ A method for file checksum calculation
- Parameters
filepath – a file path
hasher – default is md5, allowed: md5 or sha256
- Returns
file checksum value
-
igf_data.utils.fileutils.
check_file_path
(file_path)¶ A function for checking existing filepath
- Parameters
file_path – An input filepath for check
- Raises
IOError – It raises IOError if file not found
-
igf_data.utils.fileutils.
copy_local_file
(source_path, destinationa_path, cd_to_dest=True, force=False)¶ A method for copy files to local disk
- Parameters
source_path – A source file path
destinationa_path – A destination file path, including the file name ##FIX TYPO
cd_to_dest – Change to destination dir before copy, default True
force – Optional, set True to overwrite existing destination file, default is False
-
igf_data.utils.fileutils.
copy_remote_file
(source_path, destinationa_path, source_address=None, destination_address=None, copy_method='rsync', check_file=True, force_update=False, exclude_pattern_list=None)¶ A method for copy files from or to remote location
- Parameters
source_path – A source file path
destination_path – A destination file path
source_address – Address of the source server
destination_address – Address of the destination server
copy_method – A nethod for copy files, default is ‘rsync’
check_file – Check file after transfer using checksum, default True
force_update – Overwrite existing file or dir, default is False
exclude_pattern_list – List of file pattern to exclude, Deefault None
-
igf_data.utils.fileutils.
create_file_manifest_for_dir
(results_dirpath, output_file, md5_label='md5', size_lavel='size', path_label='file_path', exclude_list=None, force=True)¶ A method for creating md5 and size list for all the files in a directory path
- Parameters
results_dirpath – A file path for input file directory
output_file – Name of the output csv filepath
exclude_list – A list of file pattern to exclude from the archive, default None
force – A toggle for replacing output file, if its already present, default True
md5_label – A string for checksum column, default md5
size_lavel – A string for file size column, default size
path_label – A string for file path column, default file_path
- Returns
Nill
-
igf_data.utils.fileutils.
get_datestamp_label
(datetime_str=None)¶ A method for fetching datestamp
- Parameters
datetime_str – A datetime string to parse, default None
- Returns
A padded string of format YYYYMMDD
-
igf_data.utils.fileutils.
get_file_extension
(input_file)¶ A method for extracting file suffix information
- Parameters
input_file – A filepath for getting suffix
- Returns
A suffix string or an empty string if no suffix found
-
igf_data.utils.fileutils.
get_temp_dir
(work_dir=None, prefix='temp', use_ephemeral_space=False)¶ A function for creating temp directory
- Parameters
work_dir – A path for work directory, default None
prefix – A prefix for directory path, default ‘temp’
use_ephemeral_space – Use env variable $EPHEMERAL to get work directory, default False
- Returns
A temp_dir
-
igf_data.utils.fileutils.
list_remote_file_or_dirs
(remote_server, remote_path, only_dirs=True, only_files=False, user_name=None, user_pass=None)¶ A method for listing dirs or files on the remote dir paths
- Parameters
remote_server – Semote servet address
remote_path – Path on remote server
only_dirs – Toggle for listing only dirs, default True
only_files – Toggle for listing only files, default False
user_name – User name, default None
user_pass – User pass, default None
- Returns
A list of dir or file paths
-
igf_data.utils.fileutils.
move_file
(source_path, destinationa_path, force=False)¶ A method for moving files to local disk
- Parameters
source_path – A source file path
destination_path – A destination file path, including the file name
force – Optional, set True to overwrite existing destination file, default is False
-
igf_data.utils.fileutils.
prepare_file_archive
(results_dirpath, output_file, gzip_output=True, exclude_list=None, force=True)¶ A method for creating tar.gz archive with the files present in filepath
- Parameters
results_dirpath – A file path for input file directory
output_file – Name of the output archive filepath
gzip_output – A toggle for creating gzip output tarfile, default True
exclude_list – A list of file pattern to exclude from the archive, default None
force – A toggle for replacing output file, if its already present, default True
- Returns
None
-
igf_data.utils.fileutils.
preprocess_path_name
(input_path)¶ A method for processing a filepath. It takes a file path or dirpath and returns the same path after removing any whitespace or ascii symbols from the input.
- Parameters
path – An input file path or directory path
- Returns
A reformatted filepath or dirpath
-
igf_data.utils.fileutils.
remove_dir
(dir_path, ignore_errors=True)¶ A function for removing directory containing files
- Parameters
dir_path – A directory path
ignore_errors – Ignore errors while removing dir, default True
Load files to irods server¶
-
class
igf_data.utils.igf_irods_client.
IGF_irods_uploader
(irods_exe_dir, host='eliot.med.ic.ac.uk', zone='/igfZone', port=1247, igf_user='igf', irods_resource='woolfResc')¶ A simple wrapper for uploading files to irods server from HPC cluster CX1 Please run the following commands in the HPC cluster before running this module Add irods settings to ~/.irods/irods_environment.json > module load irods/4.2.0 > iinit (optional username) Authenticate irods settings using your password The above command will generate a file containing your iRODS password in a ‘scrambled form’
- Parameters
irods_exe_dir – A path to the bin directory where icommands are installed
-
upload_analysis_results_and_create_collection
(file_list, irods_user, project_name, analysis_name='default', dir_path_list=None, file_tag=None)¶ A method for uploading analysis files to irods server
- Parameters
file_list – A list of file paths to upload to irods
irods_user – Irods user name
project_name – Name of the project_name
analysis_name – A string for analysis name, default is ‘default’
dir_path_list – A list of directory structure for irod server, default None for using datestamp
file_tag – A text string for adding tag to collection, default None for only project_name
-
upload_fastqfile_and_create_collection
(filepath, irods_user, project_name, run_igf_id, run_date, flowcell_id=None, data_type='fastq')¶ A method for uploading files to irods server and creating collections with metadata
- Parameters
filepath – A file for upload to iRODS server
irods_user – Recipient user’s irods username
project_name – Name of the project. This will be user for collection tag
run_igf_id – A unique igf id, either seqrun or run or experiment
run_date – A unique run date
data_type – A directory label, e.g, fastq, bam or cram
Calculate storage statistics¶
-
igf_data.utils.disk_usage_utils.
get_storage_stats_in_gb
(storage_list)¶ A utility function for fetching disk usage stats (df -h) and return disk usge in Gb
- Parameters
storage_list – a input list of storage path
- Returns
A list of dictionary containing following keys
storage used available
-
igf_data.utils.disk_usage_utils.
get_sub_directory_size_in_gb
(input_path, dir_name_col='directory_name', dir_size_col='directory_size')¶ A utility function for listing disk size of all sub-directories for a given path (similar to linux command du -sh /path/* )
- Parameters
input_path – a input file path
dir_name_col – column name for directory name, default directory_name
dir_size_col – column name for directory size, default directory size
- Returns
- a list of dictionaries containing following keys
directory_name directory_size
a description dictionary for gviz_api
a column order list for gviz _api
-
igf_data.utils.disk_usage_utils.
merge_storage_stats_json
(config_file, label_file=None, server_name_col='server_name', storage_col='storage', used_col='used', available_col='available', disk_usage_col='disk_usage')¶ A utility function for merging multiple disk usage stats file generated by json dump of get_storage_stats_in_gb output
- Parameters
config_file –
a disk usage status config json file with following keys
server_name disk_usage
Each of the disk usage json files should have following keys
storage used available
label_file – an optional json file for renaming the raw disk names format: <raw name> : <print name>
- Returns
merged data as a list of dictionaries
a dictionary containing the description for the gviz_data
a list of column order
Run analysis tools¶
Process fastqc output file¶
-
igf_data.utils.fastqc_utils.
get_fastq_info_from_fastq_zip
(fastqc_zip, fastqc_datafile='*/fastqc_data.txt')¶ A function for retriving total reads and fastq file name from fastqc_zip file
- Parameters
fastqc_zip – A zip file containing fastqc results
fastqc_datafile – A pattern f
- Returns
return total read count and fastq filename
Cellranger count utils¶
-
igf_data.utils.tools.cellranger.cellranger_count_utils.
check_cellranger_count_output
(output_path, file_list=('web_summary.html', 'metrics_summary.csv', 'possorted_genome_bam.bam', 'possorted_genome_bam.bam.bai', 'filtered_feature_bc_matrix.h5', 'raw_feature_bc_matrix.h5', 'molecule_info.h5', 'cloupe.cloupe', 'analysis/tsne/2_components/projection.csv', 'analysis/clustering/graphclust/clusters.csv', 'analysis/diffexp/kmeans_3_clusters/differential_expression.csv', 'analysis/pca/10_components/variance.csv'))¶ A function for checking cellranger count output
- Parameters
output_path – A filepath for cellranger count output directory
file_list –
List of files to check in the output directory
- default file list to check
web_summary.html metrics_summary.csv possorted_genome_bam.bam possorted_genome_bam.bam.bai filtered_feature_bc_matrix.h5 raw_feature_bc_matrix.h5 molecule_info.h5 cloupe.cloupe analysis/tsne/2_components/projection.csv analysis/clustering/graphclust/clusters.csv analysis/diffexp/kmeans_3_clusters/differential_expression.csv analysis/pca/10_components/variance.csv
- Returns
Nill
- Raises
IOError – when any file is missing from the output path
-
igf_data.utils.tools.cellranger.cellranger_count_utils.
extract_cellranger_count_metrics_summary
(cellranger_tar, collection_name=None, collection_type=None, attribute_name='attribute_name', attribute_value='attribute_value', attribute_prefix='None', target_filename='metrics_summary.csv')¶ A function for extracting metrics summary file for cellranger ourput tar and parse the file. Optionally it can add the collection name and type info to the output dictionary.
- Parameters
cellranger_tar – A cellranger output tar file
target_filename – A filename for metrics summary file lookup, default metrics_summary.csv
collection_name – Optional collection name, default None
collection_type – Optional collection type, default None
attribute_tag – An optional string to add as prefix of the attribute names, default None
- Returns
A dictionary containing the metrics values
-
igf_data.utils.tools.cellranger.cellranger_count_utils.
get_cellranger_count_input_list
(db_session_class, experiment_igf_id, fastq_collection_type='demultiplexed_fastq', active_status='ACTIVE')¶ A function for fetching input list for cellranger count run for a specific experiment
- Parameters
db_session_class – A database session class
experiment_igf_id – An experiment igf id
fastq_collection_type – Fastq collection type name, default demultiplexed_fastq
active_status – text label for active runs, default ACTIVE
- Returns
A list of fastq dir path for the cellranger count run
- Raises
ValueError – It raises ValueError if no fastq directory found
BWA utils¶
-
class
igf_data.utils.tools.bwa_utils.
BWA_util
(bwa_exe, samtools_exe, ref_genome, input_fastq_list, output_dir, output_prefix, bam_output=True, thread=1, use_ephemeral_space=0)¶ Pipeline utils class for running BWA
- Parameters
bwa_exe – BWA executable path
samtools_exe – Samtools executable path
ref_genome – Reference genome index for BWA run
input_fastq_list – List of input fastq files for alignment
output_dir – Output directory path
output_prefix – Output prefix for alignment
bam_output – A toggle for writing bam output, default True
thread – No. of threads for BWA run, default 1
use_ephemeral_space – A toggle for temp dir settings, default 0
-
run_mem
(mem_cmd='mem', parameter_options=('-M', ''), samtools_cmd='view', dry_run=False)¶ A method for running Bwa mem and generate output alignment
- Parameters
mem_cmd – Bwa mem command, default mem
option_list – List of bwa mem option, default -M
samtools_cmd – Samtools view command, default view
dry_run – A toggle for returning the bwa cmd without running it, default False
- Returns
A alignment file path and bwa run cmd
Picard utils¶
-
class
igf_data.utils.tools.picard_util.
Picard_tools
(java_exe, picard_jar, input_files, output_dir, ref_fasta, picard_option=None, java_param='-Xmx4g', strand_info='NONE', threads=1, output_prefix=None, use_ephemeral_space=0, ref_flat_file=None, ribisomal_interval=None, patterned_flowcell=False, suported_commands=('CollectAlignmentSummaryMetrics', 'CollectGcBiasMetrics', 'QualityScoreDistribution', 'CollectRnaSeqMetrics', 'CollectBaseDistributionByCycle', 'MarkDuplicates', 'AddOrReplaceReadGroups'))¶ A class for running picard tool
- Parameters
java_exe – Java executable path
picard_jar – Picard path
input_files – Input bam filepaths list
output_dir – Output directory filepath
ref_fasta – Input reference fasta filepath
picard_option – Additional picard run parameters as dictionary, default None
java_param – Java parameter, default ‘-Xmx4g’
strand_info – RNA-Seq strand information, default NONE
ref_flat_file – Input ref_flat file path, default None
output_prefix – Output prefix name, default None
threads – Number of threads to run for java, default 1
use_ephemeral_space – A toggle for temp dir setting, default 0
patterned_flowcell – Toggle for marking the patterned flowcell, default False
suported_commands –
A list of supported picard commands
CollectAlignmentSummaryMetrics
CollectGcBiasMetrics
QualityScoreDistribution
CollectRnaSeqMetrics
CollectBaseDistributionByCycle
MarkDuplicates
AddOrReplaceReadGroups
-
run_picard_command
(command_name, dry_run=False)¶ A method for running generic picard command
- Parameters
command_name – Picard command name
dry_run – A toggle for returning picard command without the actual run, default False
- Returns
A list of output files from picard run and picard run command and optional picard metrics
Fastp utils¶
-
class
igf_data.utils.tools.fastp_utils.
Fastp_utils
(fastp_exe, input_fastq_list, output_dir, run_thread=1, enable_polyg_trim=False, split_by_lines_count=5000000, log_output_prefix=None, use_ephemeral_space=0, fastp_options_list=('-a', 'auto', '--qualified_quality_phred=15', '--length_required=15'))¶ A class for running fastp tool for a list of input fastq files
- Parameters
fastp_exe – A fastp executable path
input_fastq_list – A list of input files
output_dir – A output directory path
split_by_lines_count – Number of entries for splitted fastq files, default 5000000
run_thread – Number of threads to use, default 1
enable_polyg_trim – Enable poly G trim for NextSeq and NovaSeq, default False
log_output_prefix – Output prefix for log file, default None
use_ephemeral_space – A toggle for temp dir, default 0
fastp_options_list – A list of options for running fastp, default -a auto –qualified_quality_phred 15 –length_required=15
-
run_adapter_trimming
(split_fastq=False, force_overwrite=True)¶ A method for running fastp adapter trimming
- Parameters
split_fastq – Split fastq output files by line counts, default False
- Pram force_overwrite
A toggle for overwriting existing file, default True
- Returns
A list for read1 files, list of read2 files and a html report file and the fastp commandline
GATK utils¶
-
class
igf_data.utils.tools.gatk_utils.
GATK_tools
(gatk_exe, ref_fasta, use_ephemeral_space=False, java_param='-XX:ParallelGCThreads=1 -Xmx4g')¶ A python class for running gatk tools
- Parameters
gatk_exe – Gatk exe path
java_param – Java parameter, default ‘-XX:ParallelGCThreads=1 -Xmx4g’
ref_fasta – Input reference fasta filepath
use_ephemeral_space – A toggle for temp dir settings, default False
-
run_AnalyzeCovariates
(before_report_file, after_report_file, output_pdf_path, force=False, dry_run=False, gatk_param_list=None)¶ A method for running GATK AnalyzeCovariates tool
- Parameters
before_report_file – A file containing bqsr output before recalibration
after_report_file – A file containing bqsr output after recalibration
output_pdf_path – An output pdf filepath
force – Overwrite output file, if force is True
dry_run – Return GATK command, if its true, default False
gatk_param_list – List of additional params for BQSR, default None
- Returns
GATK commandline
-
run_ApplyBQSR
(bqsr_recal_file, input_bam, output_bam_path, force=False, dry_run=False, gatk_param_list=None)¶ A method for running GATK ApplyBQSR
- Parameters
input_bam – An input bam file
bqsr_recal_file – An bqsr table filepath
output_bam_path – A bam output file
force – Overwrite output file, if force is True
dry_run – Return GATK command, if its true, default False
gatk_param_list – List of additional params for BQSR, default None
- Returns
GATK commandline
-
run_BaseRecalibrator
(input_bam, output_table, known_snp_sites=None, known_indel_sites=None, force=False, dry_run=False, gatk_param_list=None)¶ A method for running GATK BaseRecalibrator
- Parameters
input_bam – An input bam file
output_table – An output table filepath for recalibration results
known_snp_sites – Known snp sites (e.g. dbSNP vcf file), default None
known_indel_sites – Known indel sites (e.g.Mills_and_1000G_gold_standard indels vcf), default None
force – Overwrite output file, if force is True
dry_run – Return GATK command, if its true, default False
gatk_param_list – List of additional params for BQSR, default None
- Returns
GATK commandline
-
run_HaplotypeCaller
(input_bam, output_vcf_path, dbsnp_vcf, emit_gvcf=True, force=False, dry_run=False, gatk_param_list=None)¶ A method for running GATK HaplotypeCaller
- Parameters
input_bam – A input bam file
output_vcf_path – A output vcf filepath
dbsnp_vcf – A dbsnp vcf file
emit_gvcf – A toggle for GVCF generation, default True
force – Overwrite output file, if force is True
dry_run – Return GATK command, if its true, default False
gatk_param_list – List of additional params for BQSR, default None
- Returns
GATK commandline
RSEM utils¶
-
class
igf_data.utils.tools.rsem_utils.
RSEM_utils
(rsem_exe_dir, reference_rsem, input_bam, threads=1, memory_limit=4000, use_ephemeral_space=0)¶ A python wrapper for running RSEM tool
- Parameters
rsem_exe_dir – RSEM executable path
reference_rsem – RSEM reference transcriptome path
input_bam – Input bam file path for RSEM
threads – No. of threads for RSEM run, default 1
memory_limit – Memory usage limit for RSEM, default 4Gb
use_ephemeral_space – A toggle for temp dir settings, default 0
-
run_rsem_calculate_expression
(output_dir, output_prefix, paired_end=True, strandedness='reverse', options=None, force=True)¶ A method for running RSEM rsem-calculate-expression tool from alignment file
- Parameters
output_dir – A output dir path
output_prefix – A output file prefix
paired_end – A toggle for paired end data, default True
strandedness – RNA strand information, default reverse for Illumina TruSeq allowed values are none, forward and reverse
options – A dictionary for rsem run, default None
force – Overwrite existing data if force is True, default False
- Returns
RSEM commandline, output file list and logfile
Samtools utils¶
-
igf_data.utils.tools.samtools_utils.
convert_bam_to_cram
(samtools_exe, bam_file, reference_file, cram_path, threads=1, force=False, dry_run=False, use_ephemeral_space=0)¶ A function for converting bam files to cram using pysam utility
- Parameters
samtools_exe – samtools executable path
bam_file – A bam filepath with / without index. Index file will be created if its missing
reference_file – Reference genome fasta filepath
cram_path – A cram output file path
threads – Number of threads to use for conversion, default 1
force – Output cram will be overwritten if force is True, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
use_ephemeral_space – A toggle for temp dir settings, default 0
- Returns
Nill
- Raises
IOError – It raises IOError if no input or reference fasta file found or output file already present and force is not True
ValueError – It raises ValueError if bam_file doesn’t have .bam extension or cram_path doesn’t have .cram extension
-
igf_data.utils.tools.samtools_utils.
filter_bam_file
(samtools_exe, input_bam, output_bam, samFlagInclude=None, reference_file=None, samFlagExclude=None, threads=1, mapq_threshold=20, cram_out=False, index_output=True, dry_run=False)¶ A function for filtering bam file using samtools view
- Parameters
samtools_exe – Samtools path
input_bam – Input bamfile path
output_bam – Output bamfile path
samFlagInclude – Sam flags to keep, default None
reference_file – Reference genome fasta filepath
samFlagExclude – Sam flags to exclude, default None
threads – Number of threads to use, default 1
mapq_threshold – Skip alignments with MAPQ smaller than this value, default None
index_output – Index output bam, default True
cram_out – Output cram file, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
Samtools command
-
igf_data.utils.tools.samtools_utils.
index_bam_or_cram
(samtools_exe, input_path, threads=1, dry_run=False)¶ A method for running samtools index
- Parameters
samtools_exe – samtools executable path
input_path – Alignment filepath
threads – Number of threads to use for conversion, default 1
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
samtools cmd list
-
igf_data.utils.tools.samtools_utils.
merge_multiple_bam
(samtools_exe, input_bam_list, output_bam_path, sorted_by_name=False, use_ephemeral_space=0, threads=1, force=False, dry_run=False, index_output=True)¶ A function for merging multiple input bams to a single output bam
- Parameters
samtools_exe – samtools executable path
input_bam_list – A file containing list of bam filepath
output_bam_path – A bam output filepath
sorted_by_name – Sort bam file by read_name, default False (for coordinate sorted bams)
threads – Number of threads to use for merging, default 1
force – Output bam file will be overwritten if force is True, default False
index_output – Index output bam, default True
use_ephemeral_space – A toggle for temp dir settings, default 0
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
samtools command
-
igf_data.utils.tools.samtools_utils.
run_bam_flagstat
(samtools_exe, bam_file, output_dir, threads=1, force=False, output_prefix=None, dry_run=False)¶ A method for generating bam flagstat output
- Parameters
samtools_exe – samtools executable path
bam_file – A bam filepath with / without index. Index file will be created if its missing
output_dir – Bam flagstat output directory path
output_prefix – Output file prefix, default None
threads – Number of threads to use for conversion, default 1
force – Output flagstat file will be overwritten if force is True, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
Output file path and a list containing samtools command
-
igf_data.utils.tools.samtools_utils.
run_bam_idxstat
(samtools_exe, bam_file, output_dir, output_prefix=None, force=False, dry_run=False)¶ A function for running samtools index stats generation
- Parameters
samtools_exe – samtools executable path
bam_file – A bam filepath with / without index. Index file will be created if its missing
output_dir – Bam idxstats output directory path
output_prefix – Output file prefix, default None
force – Output idxstats file will be overwritten if force is True, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
Output file path and a list containing samtools command
-
igf_data.utils.tools.samtools_utils.
run_bam_stats
(samtools_exe, bam_file, output_dir, threads=1, force=False, output_prefix=None, dry_run=False)¶ A method for generating samtools stats output
- Parameters
samtools_exe – samtools executable path
bam_file – A bam filepath with / without index. Index file will be created if its missing
output_dir – Bam stats output directory path
output_prefix – Output file prefix, default None
threads – Number of threads to use for conversion, default 1
force – Output flagstat file will be overwritten if force is True, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
Output file path, list containing samtools command and a list containing the SN matrics of report
-
igf_data.utils.tools.samtools_utils.
run_samtools_view
(samtools_exe, input_file, output_file, reference_file=None, force=True, cram_out=False, threads=1, samtools_params=None, index_output=True, dry_run=False, use_ephemeral_space=0)¶ A function for running samtools view command
- Parameters
samtools_exe – samtools executable path
input_file – An input bam filepath with / without index. Index file will be created if its missing
output_file – An output file path
reference_file – Reference genome fasta filepath, default None
force – Output file will be overwritten if force is True, default True
threads – Number of threads to use for conversion, default 1
samtools_params – List of samtools param, default None
index_output – Index output file, default True
dry_run – A toggle for returning the samtools command without actually running it, default False
use_ephemeral_space – A toggle for temp dir settings, default 0
- Returns
Samtools command as list
-
igf_data.utils.tools.samtools_utils.
run_sort_bam
(samtools_exe, input_bam_path, output_bam_path, sort_by_name=False, use_ephemeral_space=0, threads=1, force=False, dry_run=False, cram_out=False, index_output=True)¶ A function for sorting input bam file and generate a output bam
- Parameters
samtools_exe – samtools executable path
input_bam_path – A bam filepath
output_bam_path – A bam output filepath
sort_by_name – Sort bam file by read_name, default False (for coordinate sorting)
threads – Number of threads to use for sorting, default 1
force – Output bam file will be overwritten if force is True, default False
cram_out – Output cram file, default False
index_output – Index output bam, default True
use_ephemeral_space – A toggle for temp dir settings, default 0
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
None
STAR utils¶
-
class
igf_data.utils.tools.star_utils.
Star_utils
(star_exe, input_files, genome_dir, reference_gtf, output_dir, output_prefix, threads=1, use_ephemeral_space=0)¶ A wrapper python class for running STAR alignment
- Parameters
star_exe – STAR executable path
input_files – List of input files for running alignment
genome_dir – STAR reference transcriptome path
reference_gtf – Reference GTF file for gene annotation
output_dir – Path for output alignment and results
output_prefix – File output prefix
threads – No. of threads for STAR run, default 1
use_ephemeral_space – A toggle for temp dir settings, default 0
-
generate_aligned_bams
(two_pass_mode=True, dry_run=False, star_patameters=('--outFilterMultimapNmax', 20, '--alignSJoverhangMin', 8, '--alignSJDBoverhangMin', 1, '--outFilterMismatchNmax', 999, '--outFilterMismatchNoverReadLmax', 0.04, '--alignIntronMin', 20, '--alignIntronMax', 1000000, '--alignMatesGapMax', 1000000, '--limitBAMsortRAM', 12000000000))¶ A method running star alignment
- Parameters
two_pass_mode – Run two-pass mode of star, default True
dry_run – A toggle forreturning the star cmd without actual run, default False
star_patameters – A dictionary of star parameters, default encode parameters
- Returns
A genomic_bam and a transcriptomic bam,log file, gene count file and star commandline
-
generate_rna_bigwig
(bedGraphToBigWig_path, chrom_length_file, stranded=True, dry_run=False)¶ A method for generating bigWig signal tracks from star aligned bams files
- Parameters
bedGraphToBigWig_path – bedGraphToBigWig_path executable path
chrom_length_file – A file containing chromosome length, e.g. .fai file
:param stranded:Param for stranded analysis, default True :param dry_run: A toggle forreturning the star cmd without actual run, default False :returns: A list of bigWig files and star commandline
Subread utils¶
-
igf_data.utils.tools.subread_utils.
run_featureCounts
(featurecounts_exe, input_gtf, input_bams, output_file, thread=1, use_ephemeral_space=0, options=None)¶ A wrapper method for running featureCounts tool from subread package
- Parameters
featurecounts_exe – Path of featureCounts executable
input_gtf – Input gtf file path
input_bams – input bam files
output_file – Output filepath
thread – Thread counts, default is 1
options – FeaturCcount options, default in None
use_ephemeral_space – A toggle for temp dir settings, default 0
- Returns
A summary file path and featureCounts command
Reference genome fetch utils¶
-
class
igf_data.utils.tools.reference_genome_utils.
Reference_genome_utils
(genome_tag, dbsession_class, genome_fasta_type='GENOME_FASTA', fasta_fai_type='GENOME_FAI', genome_dict_type='GENOME_DICT', gene_gtf_type='GENE_GTF', gene_reflat_type='GENE_REFFLAT', gene_rsem_type='TRANSCRIPTOME_RSEM', bwa_ref_type='GENOME_BWA', minimap2_ref_type='GENOME_MINIMAP2', bowtie2_ref_type='GENOME_BOWTIE2', tenx_ref_type='TRANSCRIPTOME_TENX', star_ref_type='TRANSCRIPTOME_STAR', genome_dbsnp_type='DBSNP_VCF', gatk_snp_ref_type='GATK_SNP_REF', gatk_indel_ref_type='INDEL_LIST_VCF', ribosomal_interval_type='RIBOSOMAL_INTERVAL', blacklist_interval_type='BLACKLIST_BED', genome_twobit_uri_type='GENOME_TWOBIT_URI')¶ A class for accessing different components of the reference genome for a specific build
-
get_blacklist_region_bed
(check_missing=False)¶ A method for fetching blacklist interval filepath for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_dbsnp_vcf
(check_missing=True)¶ A method for fetching filepath for dbSNP vcf file, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_gatk_indel_ref
(check_missing=True)¶ A method for fetching filepaths for INDEL files from GATK bundle, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A list of filepaths
-
get_gatk_snp_ref
(check_missing=True)¶ A method for fetching filepaths for SNP files from GATK bundle, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A list of filepaths
-
get_gene_gtf
(check_missing=True)¶ A method for fetching reference gene annotation gtf filepath for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_gene_reflat
(check_missing=True)¶ A method for fetching reference gene annotation refflat filepath for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_generic_ref_files
(collection_type, check_missing=True)¶ A method for fetching filepath for generic reference genome file, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string or list (if more than one found)
-
get_genome_bowtie2
(check_missing=True)¶ A method for fetching filepath of Bowtie2 reference index, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_genome_bwa
(check_missing=True)¶ A method for fetching filepath of BWA reference index, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_genome_dict
(check_missing=True)¶ A method for fetching reference genome dictionary filepath for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_genome_fasta
(check_missing=True)¶ A method for fetching reference genome fasta filepath for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_genome_fasta_fai
(check_missing=True)¶ A method for fetching reference genome fasta fai index filepath for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_genome_minimap2
(check_missing=True)¶ A method for fetching filepath of Minimap2 reference index, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_ribosomal_interval
(check_missing=True)¶ A method for fetching ribosomal interval filepath for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_transcriptome_rsem
(check_missing=False)¶ A method for fetching filepath of RSEM reference transcriptome, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_transcriptome_star
(check_missing=False)¶ A method for fetching filepath of STAR reference transcriptome, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_transcriptome_tenx
(check_missing=True)¶ A method for fetching filepath of 10X Cellranger reference transcriptome, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A filepath string
-
get_twobit_genome_url
(check_missing=True)¶ A method for fetching filepath for twobit genome url, for a specific genome build
- Parameters
check_missing – A toggle for checking errors for missing files, default True
- Returns
A url string
-
Samtools utils¶
-
igf_data.utils.tools.samtools_utils.
convert_bam_to_cram
(samtools_exe, bam_file, reference_file, cram_path, threads=1, force=False, dry_run=False, use_ephemeral_space=0) A function for converting bam files to cram using pysam utility
- Parameters
samtools_exe – samtools executable path
bam_file – A bam filepath with / without index. Index file will be created if its missing
reference_file – Reference genome fasta filepath
cram_path – A cram output file path
threads – Number of threads to use for conversion, default 1
force – Output cram will be overwritten if force is True, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
use_ephemeral_space – A toggle for temp dir settings, default 0
- Returns
Nill
- Raises
IOError – It raises IOError if no input or reference fasta file found or output file already present and force is not True
ValueError – It raises ValueError if bam_file doesn’t have .bam extension or cram_path doesn’t have .cram extension
-
igf_data.utils.tools.samtools_utils.
filter_bam_file
(samtools_exe, input_bam, output_bam, samFlagInclude=None, reference_file=None, samFlagExclude=None, threads=1, mapq_threshold=20, cram_out=False, index_output=True, dry_run=False) A function for filtering bam file using samtools view
- Parameters
samtools_exe – Samtools path
input_bam – Input bamfile path
output_bam – Output bamfile path
samFlagInclude – Sam flags to keep, default None
reference_file – Reference genome fasta filepath
samFlagExclude – Sam flags to exclude, default None
threads – Number of threads to use, default 1
mapq_threshold – Skip alignments with MAPQ smaller than this value, default None
index_output – Index output bam, default True
cram_out – Output cram file, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
Samtools command
-
igf_data.utils.tools.samtools_utils.
index_bam_or_cram
(samtools_exe, input_path, threads=1, dry_run=False) A method for running samtools index
- Parameters
samtools_exe – samtools executable path
input_path – Alignment filepath
threads – Number of threads to use for conversion, default 1
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
samtools cmd list
-
igf_data.utils.tools.samtools_utils.
merge_multiple_bam
(samtools_exe, input_bam_list, output_bam_path, sorted_by_name=False, use_ephemeral_space=0, threads=1, force=False, dry_run=False, index_output=True) A function for merging multiple input bams to a single output bam
- Parameters
samtools_exe – samtools executable path
input_bam_list – A file containing list of bam filepath
output_bam_path – A bam output filepath
sorted_by_name – Sort bam file by read_name, default False (for coordinate sorted bams)
threads – Number of threads to use for merging, default 1
force – Output bam file will be overwritten if force is True, default False
index_output – Index output bam, default True
use_ephemeral_space – A toggle for temp dir settings, default 0
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
samtools command
-
igf_data.utils.tools.samtools_utils.
run_bam_flagstat
(samtools_exe, bam_file, output_dir, threads=1, force=False, output_prefix=None, dry_run=False) A method for generating bam flagstat output
- Parameters
samtools_exe – samtools executable path
bam_file – A bam filepath with / without index. Index file will be created if its missing
output_dir – Bam flagstat output directory path
output_prefix – Output file prefix, default None
threads – Number of threads to use for conversion, default 1
force – Output flagstat file will be overwritten if force is True, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
Output file path and a list containing samtools command
-
igf_data.utils.tools.samtools_utils.
run_bam_idxstat
(samtools_exe, bam_file, output_dir, output_prefix=None, force=False, dry_run=False) A function for running samtools index stats generation
- Parameters
samtools_exe – samtools executable path
bam_file – A bam filepath with / without index. Index file will be created if its missing
output_dir – Bam idxstats output directory path
output_prefix – Output file prefix, default None
force – Output idxstats file will be overwritten if force is True, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
Output file path and a list containing samtools command
-
igf_data.utils.tools.samtools_utils.
run_bam_stats
(samtools_exe, bam_file, output_dir, threads=1, force=False, output_prefix=None, dry_run=False) A method for generating samtools stats output
- Parameters
samtools_exe – samtools executable path
bam_file – A bam filepath with / without index. Index file will be created if its missing
output_dir – Bam stats output directory path
output_prefix – Output file prefix, default None
threads – Number of threads to use for conversion, default 1
force – Output flagstat file will be overwritten if force is True, default False
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
Output file path, list containing samtools command and a list containing the SN matrics of report
-
igf_data.utils.tools.samtools_utils.
run_samtools_view
(samtools_exe, input_file, output_file, reference_file=None, force=True, cram_out=False, threads=1, samtools_params=None, index_output=True, dry_run=False, use_ephemeral_space=0) A function for running samtools view command
- Parameters
samtools_exe – samtools executable path
input_file – An input bam filepath with / without index. Index file will be created if its missing
output_file – An output file path
reference_file – Reference genome fasta filepath, default None
force – Output file will be overwritten if force is True, default True
threads – Number of threads to use for conversion, default 1
samtools_params – List of samtools param, default None
index_output – Index output file, default True
dry_run – A toggle for returning the samtools command without actually running it, default False
use_ephemeral_space – A toggle for temp dir settings, default 0
- Returns
Samtools command as list
-
igf_data.utils.tools.samtools_utils.
run_sort_bam
(samtools_exe, input_bam_path, output_bam_path, sort_by_name=False, use_ephemeral_space=0, threads=1, force=False, dry_run=False, cram_out=False, index_output=True) A function for sorting input bam file and generate a output bam
- Parameters
samtools_exe – samtools executable path
input_bam_path – A bam filepath
output_bam_path – A bam output filepath
sort_by_name – Sort bam file by read_name, default False (for coordinate sorting)
threads – Number of threads to use for sorting, default 1
force – Output bam file will be overwritten if force is True, default False
cram_out – Output cram file, default False
index_output – Index output bam, default True
use_ephemeral_space – A toggle for temp dir settings, default 0
dry_run – A toggle for returning the samtools command without actually running it, default False
- Returns
None
Scanpy utils¶
Metadata processing¶
Register metadata for new projects¶
-
class
igf_data.process.seqrun_processing.find_and_register_new_project_data.
Find_and_register_new_project_data
(projet_info_path, dbconfig, user_account_template, log_slack=True, slack_config=None, check_hpc_user=False, hpc_user=None, hpc_address=None, ldap_server=None, setup_irods=True, notify_user=True, default_user_email='igf@imperial.ac.uk', project_lookup_column='project_igf_id', user_lookup_column='email_id', data_authority_column='data_authority', sample_lookup_column='sample_igf_id', barcode_check_keyword='barcode_check', metadata_sheet_name='Project metadata', sendmail_exe='/usr/sbin/sendmail')¶ A class for finding new data for project and registering them to the db. Account for new users will be created in irods server and password will be mailed to them.
- Parameters
projet_info_path – A directory path for project info files
dbconfig – A json dbconfig file
check_hpc_user – Guess the hpc user name, True or False, default: False
hpc_user – A hpc user name, default is None
hpc_address – A hpc host address, default is None
ldap_server – A ldap server address for search, default is None
user_account_template – A template file for user account activation email
log_slack – Enable or disable sending message to slack, default: True
slack_config – A slack config json file, required if log_slack is True
project_lookup_column – project data lookup column, default project_igf_id
user_lookup_column – user data lookup column, default email_id
sample_lookup_column – sample data lookup column, default sample_igf_id
data_authority_column – data authority column name, default data_authority
setup_irods – Setup irods account for user, default is True
notify_user – Send email notification to user, default is True
default_user_email – Add another user as the default collaborator for all new projects, default igf@imperial.ac.uk
barcode_check_keyword – Project attribute name for barcode check settings, default barcode_check
sendmail_exe – Sendmail executable path, default /usr/sbin/sendmail
-
process_project_data_and_account
()¶ A method for finding new project info and registering them to database and user account creation
Update experiment metadata from sample attributes¶
-
class
igf_data.process.metadata.experiment_metadata_updator.
Experiment_metadata_updator
(dbconfig_file, log_slack=True, slack_config=None)¶ A class for updating metadata for experiment table in database
-
update_metadta_from_sample_attribute
(experiment_igf_id=None, sample_attribute_names=('library_source', 'library_strategy', 'experiment_type'))¶ A method for fetching experiment metadata from sample_attribute tables :param experiment_igf_id: An experiment igf id for updating only a selected experiment, default None for all experiments :param sample_attribute_names: A list of sample attribute names to look for experiment metadata,
default: library_source, library_strategy, experiment_type
-
Sequencing run¶
Process samplesheet file¶
-
class
igf_data.illumina.samplesheet.
SampleSheet
(infile, data_header_name='Data')¶ A class for processing SampleSheet files for Illumina sequencing runs
- Parameters
infile – A samplesheet file
data_header_name – name of the data section, default Data
-
add_pseudo_lane_for_miseq
(lane='1')¶ A method for adding pseudo lane information for the nextseq platform
- Parameters
lane – A lane id for pseudo lane value
-
add_pseudo_lane_for_nextseq
(lanes=('1', '2', '3', '4'))¶ A method for adding pseudo lane information for the nextseq platform
- Parameters
lanes – A list of pseudo lanes, default [‘1’,’2’,’3’,’4’]
:returns:None
-
check_sample_header
(section, condition_key)¶ Function for checking SampleSheet header
- Parameters
section – A field name for header info check
condition_key – A condition key for header info check
- Returns
zero if its not present or number of occurrence of the term
-
filter_sample_data
(condition_key, condition_value, method='include', lane_header='Lane', lane_default_val='1')¶ Function for filtering SampleSheet data based on matching condition
- Parameters
condition_key – A samplesheet column name
condition_value – A keyword present in the selected column
method – ‘include’ or ‘exclude’ for adding or removing selected column from the samplesheet default is include
-
get_index_count
()¶ A function for getting index length counts
- Returns
A dictionary, with the index columns as the key
-
get_indexes
()¶ A method for retrieving the indexes from the samplesheet
- Returns
A list of index barcodes
-
get_lane_count
(lane_field='Lane', target_platform='HiSeq')¶ Function for getting the lane information for HiSeq runs It will return 1 for both MiSeq and NextSeq runs
- Parameters
lane_field – Column name for lane info, default ‘Lane’
target_platform – Hiseq platform tag, default ‘HiSeq’
- Returns
A list of lanes present in samplesheet file
-
get_platform_name
(section='Header', field='Application')¶ Function for getting platform details from samplesheet header
- Parameters
section – File section for lookup, default ‘Header’
field – Field name for platform info, default ‘Application’
-
get_project_and_lane
(project_tag='Sample_Project', lane_tag='Lane')¶ A method for fetching project and lane information from samplesheet
- Parameters
project_tag – A string for project name column in the samplesheet, default Sample_Project
lane_tag – A string for Lane id column in the samplesheet, default Lane
- Returns
A list of project name (for all) and lane information (only for hiseq)
-
get_project_names
(tag='sample_project')¶ Function for retrieving unique project names from samplesheet. If there are multiple matching headers, the first column will be used
- Parameters
tag – Name of tag for project lookup, default sample_project
- Returns
A list of unique project name
-
get_reverse_complement_index
(index_field='index2')¶ A function for changing the I5_index present in the index2 field of the samplesheet to intsreverse complement base
- Parameters
index_field – Column name for index 2, default index2
-
group_data_by_index_length
()¶ Function for grouping samplesheet rows based on the combined length of index columns By default, this function removes Ns from the index
- Returns
A dictionary of samplesheet objects, with combined index length as the key
-
modify_sample_header
(section, type, condition_key, condition_value='')¶ Function for modifying SampleSheet header
- Parameters
section – A field name for header info check
condition_key – A condition key for header info check
type – Mode type, ‘add’ or ‘remove’
condition_value – Its is required for ‘add’ type
-
print_sampleSheet
(outfile)¶ Function for printing output SampleSheet
- Parameters
outfile – A output samplesheet path
-
validate_samplesheet_data
(schema_json)¶ A method for validation of samplesheet data
- Parameters
schema – A JSON schema for validation of the samplesheet data
:return a list of error messages or an empty list if no error found
Fetch read cycle info from RunInfo.xml file¶
-
class
igf_data.illumina.runinfo_xml.
RunInfo_xml
(xml_file)¶ A class for reading runinfo xml file from illumina sequencing runs
- Parameters
xml_file – A runinfo xml file
-
get_flowcell_name
()¶ A mthod for accessing flowcell name from the runinfo xml file
-
get_platform_number
()¶ Function for fetching the instrument series number
-
get_reads_stats
(root_tag='read', number_tag='number', tags=('isindexedread', 'numcycles'))¶ A method for getting read and index stats from the RunInfo.xml file
- Parameters
root_tag – Root tag for xml file, default read
number_tag – Number tag for xml file, default number
tags – List of tags for xml lookup, default [‘isindexedread’,’numcycles’]
- Returns
A dictionary with the read number as the key
Fetch flowcell info from runparameters xml file¶
-
class
igf_data.illumina.runparameters_xml.
RunParameter_xml
(xml_file)¶ A class for reading runparameters xml file from Illumina sequencing runs
- Parameters
xml_file – A runparameters xml file
-
get_hiseq_flowcell
()¶ A method for fetching flowcell details for hiseq run
- Returns
Flowcell info or None (for MiSeq and NextSeq runs)
Find and process new sequencing run for demultiplexing¶
-
igf_data.process.seqrun_processing.find_and_process_new_seqrun.
calculate_file_md5
(seqrun_info, md5_out, seqrun_path, file_suffix='md5.json', exclude_dir=())¶ A method for file md5 calculation for all the sequencing run files
- Parameters
seqrun_info – A dictionary containing sequencing run information
md5_out – JSON md5 file output directory
file_suffix – Suffix information for new JSON md5 files, default: md5.json
exclude_dir – A list of directories to exclude from the file look up
- Returns
Output is a dictionary of json files
{seqrun_name: seqrun_md5_list_path} Format of the json file [{“seqrun_file_name”:”file_path”,”file_md5”:”md5_value”}]
-
igf_data.process.seqrun_processing.find_and_process_new_seqrun.
check_finished_seqrun_dir
(seqrun_dir, seqrun_path, required_files=('RTAComplete.txt', 'SampleSheet.csv', 'RunInfo.xml'))¶ A method for checking complete sequencing run directory
- Parameters
seqrun_dir – A list of sequencing run names
seqrun_path – A directory path for new sequencing run look up
required_files – A list of files to check before marking sequencing run as complete, default: ‘RTAComplete.txt’,’SampleSheet.csv’,’RunInfo.xml’
- Returns
A dictionary containing valid sequencing run information
-
igf_data.process.seqrun_processing.find_and_process_new_seqrun.
check_for_registered_project_and_sample
(seqrun_info, dbconfig, samplesheet_file='SampleSheet.csv')¶ A method for fetching project and sample records from samplesheet and checking for registered samples in db
- Parameters
seqrun_info – A dictionary containing seqrun name and path as key and values
dbconfig – A database configuration file
samplesheet_file – Name of samplesheet file, default is SampleSheet.csv
- Returns
A dictionary containing the new run information A string message containing database checking information
-
igf_data.process.seqrun_processing.find_and_process_new_seqrun.
check_seqrun_dir_in_db
(all_seqrun_dir, dbconfig)¶ A method for checking existing seqrun dirs in database
- Parameters
all_seqrun_dir – list of seqrun dirs to check
dbconfig – dbconfig
- Returns
A list containing new sequencing run information
-
igf_data.process.seqrun_processing.find_and_process_new_seqrun.
find_new_seqrun_dir
(path, dbconfig)¶ A method for check and finding new sequencing run directory
- Parameters
path – A directory path for new sequencing run lookup
dbconfig – A database configuration file
- Returns
A list of new sequencing run names for processing
-
igf_data.process.seqrun_processing.find_and_process_new_seqrun.
load_seqrun_files_to_db
(seqrun_info, seqrun_md5_info, dbconfig, file_type='ILLUMINA_BCL_MD5')¶ A method for loading md5 lists to collection and files table
- Parameters
seqrun_info – A dictionary containing the sequencing run information
seqrun_md5_info – A dictionary containing the sequencing run JSON md5 file info
dbconfig – A database configuration file
file_type – A collection type information for loading the JSON files to database
- Returns
Nill
-
igf_data.process.seqrun_processing.find_and_process_new_seqrun.
prepare_seqrun_for_db
(seqrun_name, seqrun_path, session_class)¶ A method for preparing seqrun data for database
- Parameters
seqrun_name – A sequencing run name
seqrun_path – A directory path for sequencing run look up
session_class – A database session class
- Returns
A dictionary containing information to populate the seqrun table in database
-
igf_data.process.seqrun_processing.find_and_process_new_seqrun.
seed_pipeline_table_for_new_seqrun
(pipeline_name, dbconfig)¶ A method for seeding pipelines for the new seqruns
- Parameters
pipeline_name – A pipeline name
dbconfig – A dbconfig file
- Returns
Nill
-
igf_data.process.seqrun_processing.find_and_process_new_seqrun.
validate_samplesheet_for_seqrun
(seqrun_info, schema_json, output_dir, samplesheet_file='SampleSheet.csv')¶ A method for validating samplesheet and writing errors to a report file
- Parameters
seqrun_info – A dictionary containing seqrun name and path as key and values
schema_json – A json schema for samplesheet validation
output_dir – A directory path for writing output report files
samplesheet_file – Samplesheet filename, default ‘SampleSheet.csv’
- Returns
new_seqrun_info, A new dictionary containing seqrun name and path as key and values
- Returns
error_file_list, A dictionary containing seqrun name and error file paths as key and values
Demultiplexing¶
Bases mask calculation¶
-
class
igf_data.illumina.basesMask.
BasesMask
(samplesheet_file, runinfo_file, read_offset, index_offset)¶ A class for bases mask value calculation for demultiplexing of sequencing runs
- Parameters
samplesheet_file – A samplesheet file containing sample index barcodes
runinfo_file – A runinfo xml file from sequencing run
read_offset – Read offset value in bp
index_offset – Index offset value in bp
-
calculate_bases_mask
(numcycle_label='numcycles', isindexedread_label='isindexedread')¶ A method for bases mask value calculation
- Parameters
numcycle_label – Cycle label in runinfo xml file, default numcycles
isindexedread_label – Index cycle label in runinfo xml file, default isindexedread
- Returns
A formatted bases mask value for bcl2fastq run
Copy bcl files for demultiplexing¶
Collect demultiplexed fastq files to database¶
-
class
igf_data.process.seqrun_processing.collect_seqrun_fastq_to_db.
Collect_seqrun_fastq_to_db
(fastq_dir, model_name, seqrun_igf_id, session_class, flowcell_id, samplesheet_file=None, samplesheet_filename='SampleSheet.csv', collection_type='demultiplexed_fastq', file_location='HPC_PROJECT', collection_table='run', manifest_name='file_manifest.csv', singlecell_tag='10X')¶ A class for collecting raw fastq files after demultiplexing and storing them in database. Additionally this will also create relevant entries for the experiment and run tables in database
- Parameters
fastq_dir – A directory path for file look up
model_name – Sequencing platform information
seqrun_igf_id – Sequencing run name
session_class – A database session class
flowcell_id – Flowcell information for the run
samplesheet_file – Samplesheet filepath
samplesheet_filename – Name of the samplesheet file, default SampleSheet.csv
collection_type – Collection type information for new fastq files, default demultiplexed_fastq
file_location – Fastq file location information, default HPC_PROJECT
collection_table – Collection table information for fastq files, default run
manifest_name – Name of the file manifest file, default file_manifest.csv
singlecell_tag – Samplesheet description for singlecell samples, default 10X
-
find_fastq_and_build_db_collection
()¶ A method for finding fastq files and samplesheet under a run directory and loading the new files to db with their experiment and run information
It calculates following entries
- library_name
Same as sample_id unless mentioned in ‘Description’ field of samplesheet
- experiment_igf_id
library_name combined with the platform name same library sequenced in different platform will be added as separate experiemnt
- run_igf_id
experiment_igf_id combined with sequencing flowcell_id and lane_id collection name: Same as run_igf_id, fastq files will be added to db collection using this id
- collection type
Default type for fastq file collections are ‘demultiplexed_fastq’
- file_location
Default value is ‘HPC_PROJECT’
Check demultiplexing barcode stats¶
Pipeline control¶
Reset pipeline seeds for re-processing¶
-
class
igf_data.process.pipeline.modify_pipeline_seed.
Modify_pipeline_seed
(igf_id_list, table_name, pipeline_name, dbconfig_file, log_slack=True, log_asana=True, slack_config=None, asana_project_id=None, asana_config=None, clean_up=True)¶ A class for changing pipeline run status in the pipeline_seed table
-
reset_pipeline_seed_for_rerun
(seeded_label='SEEDED', restricted_status_list=('SEEDED', 'RUNNING'))¶ A method for setting the pipeline for re-run if the first run has failed or aborted This method will set the pipeline_seed.status as ‘SEEDED’ only if its not already ‘SEEDED’ or ‘RUNNING’ :param seeded_label: A text label for seeded status, default SEEDED :param restricted_status_list: A list of pipeline status to exclude from the search,
default [‘SEEDED’,’RUNNING’]
-
Reset samplesheet files after modification for rerunning pipeline¶
-
class
igf_data.process.seqrun_processing.reset_samplesheet_md5.
Reset_samplesheet_md5
(seqrun_path, seqrun_igf_list, dbconfig_file, clean_up=True, json_collection_type='ILLUMINA_BCL_MD5', log_slack=True, log_asana=True, slack_config=None, asana_project_id=None, asana_config=None, samplesheet_name='SampleSheet.csv')¶ A class for modifying samplesheet md5 for seqrun data processing
-
run
()¶ A method for resetting md5 values in the samplesheet json files for all seqrun ids
-
Demultiplexing of single cell sample¶
Modify samplesheet for singlecell samples¶
-
class
igf_data.process.singlecell_seqrun.processsinglecellsamplesheet.
ProcessSingleCellSamplesheet
(samplesheet_file, singlecell_barcode_json, singlecell_tag='10X', index_column='index', sample_id_column='Sample_ID', sample_name_column='Sample_Name', orig_sample_id='Original_Sample_ID', orig_sample_name='Original_Sample_Name', sample_description_column='Description', orig_index='Original_index')¶ A class for processing samplesheet containing single cell (10X) index barcodes It requires a json format file listing all the single cell barcodes downloaded from this page https://support.10xgenomics.com/single-cell-gene-expression/sequencing/doc/ specifications-sample-index-sets-for-single-cell-3
required params: samplesheet_file: A samplesheet containing single cell samples singlecell_barcode_json: A JSON file listing single cell indexes singlecell_tag: A text keyword for the single cell sample description index_column: Column name for index lookup, default ‘index’ sample_id_column: Column name for sample_id lookup, default ‘Sample_ID’ sample_name_column: Column name for sample_name lookup, default ‘Sample_NAme’ orig_sample_id: Column name for keeping original sample ids, default ‘Original_Sample_ID’ orig_sample_name: Column name for keeping original sample_names, default: ‘Original_Sample_Name’ orig_index: Column name for keeping original index, default ‘Original_index’
-
change_singlecell_barcodes
(output_samplesheet)¶ A method for replacing single cell index codes present in the samplesheet with the four index sequences. This method will create 4 samplesheet entries for each of the single cell samples with _1 to _4 suffix and relevant indexes
required params: output_samplesheet: A file name of the output samplesheet
-
Merge fastq files for single cell samples¶
-
class
igf_data.process.singlecell_seqrun.mergesinglecellfastq.
MergeSingleCellFastq
(fastq_dir, samplesheet, platform_name, singlecell_tag='10X', sampleid_col='Sample_ID', samplename_col='Sample_Name', use_ephemeral_space=0, orig_sampleid_col='Original_Sample_ID', description_col='Description', orig_samplename_col='Original_Sample_Name', project_col='Sample_Project', lane_col='Lane', pseudo_lane_col='PseudoLane', force_overwrite=True)¶ A class for merging single cell fastq files per lane per sample
- Parameters
fastq_dir – A directory path containing fastq files
samplesheet – A samplesheet file used demultiplexing of bcl files
platform_name – A sequencing platform name
singlecell_tag – A single cell keyword for description field, default ‘10X’
sampleid_col – A keyword for sample id column of samplesheet, default ‘Sample_ID’
samplename_col – A keyword for sample name column of samplesheet, default ‘Sample_Name’
orig_sampleid_col – A keyword for original sample id column, default ‘Original_Sample_ID’
orig_samplename_col – A keyword for original sample name column, default ‘Original_Sample_Name’
description_col – A keyword for description column, default ‘Description’
project_col – A keyword for project column, default ‘Sample_Project’
pseudo_lane_col – A keyword for pseudo lane column, default ‘PseudoLane’
lane_col – A keyword for lane column, default ‘Lane’
force_overwrite – A toggle for overwriting output fastqs, default True
- SampleSheet file should contain following columns:
Sample_ID: A single cell sample id in the following format, SampleId_{digit}
Sample_Name: A single cell sample name in the following format, SampleName_{digit}
Original_Sample_ID: An IGF sample id
Original_Sample_Name: A sample name provided by user
Description: A single cell label, default 10X
-
merge_fastq_per_lane_per_sample
()¶ A method for merging single cell fastq files present in input fastq_dir per lane per sample basis
Report page building¶
Configure Biodalliance genome browser for qc page¶
-
class
igf_data.utils.config_genome_browser.
Config_genome_browser
(dbsession_class, project_igf_id, collection_type_list, pipeline_name, collection_table, species_name, ref_genome_type, track_file_type=None, analysis_path_prefix='analysis', use_ephemeral_space=0, analysis_dir_structure_list=('sample_igf_id', ))¶ A class for configuring genome browser input files for analysis track visualization
- Parameters
dbsession_class – A database session class
project_igf_id – A project igf id
collection_type_list – A list of collection types to include in the track
pipeline_name – Name of the analysis pipeline for status checking
collection_table – Name of file collection table name
species_name – Species name for ref genome fetching
ref_genome_type – Reference genome type for remote tracks
track_file_type – Additional track file collection types
analysis_path_prefix – Top level dir name for analysis files, default ‘analysis’
use_ephemeral_space – A toggle for temp dir settings, default 0
analysis_dir_structure_list – List of keywords for sub directory paths, default [‘sample_igf_id’]
-
build_biodalliance_config
(template_file, output_file)¶ A method for building biodalliance specific config file :param template_file: A template file path :param output_file: An output filepath
Process Google chart json data¶
-
igf_data.utils.gviz_utils.
convert_to_gviz_json_for_display
(description, data, columns_order, output_file=None)¶ A utility method for writing gviz format json file for data display using Google charts
:param description, A dictionary for the data table description :param data, A dictionary containing the data table :column_order, A tuple of data table column order :param output_file, Output filename, default None :returns: None if output_file name is present, or else json_data string
Generate data for QC project page¶
-
igf_data.utils.project_data_display_utils.
add_seqrun_path_info
(input_data, output_file, seqrun_col='seqrun_igf_id', flowcell_col='flowcell_id', path_col='path')¶ A utility method for adding remote path to a dataframe for each sequencing runs of a project
required params: :param input_data, A input dataframe containing the following columns
seqrun_igf_id flowcell_id
:param seqrun_col, Column name for sequencing run id, default seqrun_igf_id :param flowcell_col, Column namae for flowcell id, default flowcell_id :param path_col, Column name for path, default path output_file: An output filepath for the json data
-
igf_data.utils.project_data_display_utils.
convert_project_data_gviz_data
(input_data, sample_col='sample_igf_id', read_count_col='attribute_value', seqrun_col='flowcell_id')¶ A utility method for converting project’s data availability information to gviz data table format https://developers.google.com/chart/interactive/docs/reference#DataTable
required params: :param input_data: A pandas data frame, it should contain following columns
sample_igf_id, flowcell_id, attribute_value (R1_READ_COUNT)
:param sample_col, Column name for sample id, default sample_igf_id :param seqrun_col, Column name for sequencing run identifier, default flowcell_id :param read_count_col, Column name for sample read counts, default attribute_value
- return
a dictionary of description a list of data dictionary a tuple of column_order
Generate data for QC status page¶
-
class
igf_data.utils.project_status_utils.
Project_status
(igf_session_class, project_igf_id, seqrun_work_day=2, analysis_work_day=1, sequencing_resource_name='Sequencing', demultiplexing_resource_name='Demultiplexing', analysis_resource_name='Primary Analysis', task_id_label='task_id', task_name_label='task_name', resource_label='resource', dependencies_label='dependencies', start_date_label='start_date', end_date_label='end_date', duration_label='duration', percent_complete_label='percent_complete')¶ A class for project status fetch and gviz json file generation for Google chart grantt plot
- Parameters
igf_session_class – Database session class
project_igf_id – Project igf id for database lookup
seqrun_work_day – Duration for seqrun jobs in days, default 2
analysis_work_day – Duration for analysis jobs in days, default 1
sequencing_resource_name – Resource name for sequencing data, default Sequencing
demultiplexing_resource_name – Resource name for demultiplexing data,default Demultiplexing
analysis_resource_name – Resource name for analysis data, default Primary Analysis
task_id_label – Label for task id field, default task_id
task_name_label – Label for task name field, default task_name
resource_label – Label for resource field, default resource
start_date_label – Label for start date field, default start_date
end_date_label – Label for end date field, default end_date
duration_label – Label for duration field, default duration
percent_complete_label – Label for percent complete field, default percent_complete
dependencies_label – Label for dependencies field, default dependencies
-
generate_gviz_json_file
(output_file, demultiplexing_pipeline, analysis_pipeline, active_seqrun_igf_id=None)¶ A wrapper method for writing a gviz json file with project status information
- Parameters
output_file – A filepath for writing project status
analysis_pipeline – Name of the analysis pipeline
demultiplexing_pipeline – Name of the demultiplexing pipeline
analysis_pipeline – name of the analysis pipeline
active_seqrun_igf_id – Igf id go the active seqrun, default None
- Returns
None
-
get_analysis_info
(analysis_pipeline)¶ A method for fetching all active experiments and their run status for a project
- Parameters
analysis_pipeline – Name of the analysis pipeline
- Returns
A list of dictionary containing the analysis information
-
get_seqrun_info
(active_seqrun_igf_id=None, demultiplexing_pipeline=None)¶ A method for fetching all active sequencing runs for a project
- Parameters
active_seqrun_igf_id – Seqrun igf id for the current run, default None
demultiplexing_pipeline – Name of the demultiplexing pipeline, default None
- Returns
A dictionary containing seqrun information
-
static
get_status_column_order
()¶ A method for fetching column order for status json data
- Returns
A list data containing the column order
-
static
get_status_description
()¶ A method for getting description for status json data
- Returns
A dictionary containing status info
Generate data for QC analysis page¶
-
class
igf_data.utils.project_analysis_utils.
Project_analysis
(igf_session_class, collection_type_list, remote_analysis_dir='analysis', use_ephemeral_space=0, attribute_collection_file_type='ANALYSIS_CRAM', pipeline_name='PrimaryAnalysisCombined', pipeline_seed_table='experiment', pipeline_finished_status='FINISHED', sample_id_label='SAMPLE_ID')¶ A class for fetching all the analysis files linked to a project
- Parameters
igf_session_class – A database session class
collection_type_list – A list of collection type for database lookup
remote_analysis_dir – A remote path prefix for analysis file look up, default analysis
attribute_collection_file_type – A filetype list for fetching collection attribute records, default (‘ANALYSIS_CRAM’)
-
get_analysis_data_for_project
(project_igf_id, output_file, chart_json_output_file=None, csv_output_file=None, gviz_out=True, file_path_column='file_path', type_column='type', sample_igf_id_column='sample_igf_id')¶ A method for fetching all the analysis files for a project
- Parameters
project_igf_id – A project igf id for database lookup
output_file – An output filepath, either a csv or a gviz json
gviz_out – A toggle for converting output to gviz output, default is True
sample_igf_id_column – A column name for sample igf id, default sample_igf_id
file_path_column – A column name for file path, default file_path
type_column – A column name for collection type, default type