IGF database schema and api

Database schema

class igf_data.igfdb.igfTables.Analysis(**kwargs)

A table for loading analysis design information

Parameters
  • analysis_id – An integer id for analysis table

  • project_id – A required integer id from project table (foreign key)

  • analysis_type

    An optional enum list to specify analysis type, default is UNKNOWN, allowed values are

    • RNA_DIFFERENTIAL_EXPRESSION

    • RNA_TIME_SERIES

    • CHIP_PEAK_CALL

    • SOMATIC_VARIANT_CALLING

    • UNKNOWN

  • analysis_description – An optional json description for analysis

class igf_data.igfdb.igfTables.Collection(**kwargs)

A table for loading collection information

Parameters
  • collection_id – An integer id for collection table

  • name – A required string to specify collection name, allowed length 70

  • type – A required string to specify collection type, allowed length 50

  • table – An optional enum list to specify collection table information, default unknown, allowed values are sample, experiment, run, file, project, seqrun and unknown

  • date_stamp – An optional timestamp column to record entry creation or modification time, default current timestamp

class igf_data.igfdb.igfTables.Collection_attribute(**kwargs)

A table for loading collection attributes

Parameters
  • collection_attribute_id – An integer id for collection_attribute table

  • attribute_name – An optional string attribute name, allowed length 200

  • attribute_value – An optional string attribute value, allowed length 200

  • collection_id – An integer id from collection table (foreign key)

class igf_data.igfdb.igfTables.Collection_group(**kwargs)

A table for linking files to the collection entries

Parameters
  • collection_group_id – An integer id for collection_group table

  • collection_id – A required integer id from collection table (foreign key)

  • file_id – A required integer id from file table (foreign key)

class igf_data.igfdb.igfTables.Experiment(**kwargs)

A table for loading experiment (unique combination of sample, library and platform) information.

Parameters
  • experiment_id – An integer id for experiment table

  • experiment_igf_id – A required string as experiment id specific to IGF team, allowed length 40

  • project_id – A required integer id from project table (foreign key)

  • sample_id – A required integer id from sample table (foreign key)

  • library_name – A required string to specify library name, allowed length 50

  • library_source

    An optional enum list to specify library source information, default is UNKNOWN, allowed values are

    • GENOMIC

    • TRANSCRIPTOMIC

    • GENOMIC_SINGLE_CELL

    • TRANSCRIPTOMIC_SINGLE_CELL

    • METAGENOMIC

    • METATRANSCRIPTOMIC

    • SYNTHETIC

    • VIRAL_RNA

    • UNKNOWN

  • library_strategy

    An optional enum list to specify library strategy information, default is UNKNOWN, allowed values are

    • WGS

    • WXS

    • WGA

    • RNA-SEQ

    • CHIP-SEQ

    • ATAC-SEQ

    • MIRNA-SEQ

    • NCRNA-SEQ

    • FL-CDNA

    • EST

    • HI-C

    • DNASE-SEQ

    • WCS

    • RAD-SEQ

    • CLONE

    • POOLCLONE

    • AMPLICON

    • CLONEEND

    • FINISHING

    • MNASE-SEQ

    • DNASE-HYPERSENSITIVITY

    • BISULFITE-SEQ

    • CTS

    • MRE-SEQ

    • MEDIP-SEQ

    • MBD-SEQ

    • TN-SEQ

    • VALIDATION

    • FAIRE-SEQ

    • SELEX

    • RIP-SEQ

    • CHIA-PET

    • SYNTHETIC-LONG-READ

    • TARGETED-CAPTURE

    • TETHERED

    • NOME-SEQ

    • CHIRP SEQ

    • 4-C-SEQ

    • 5-C-SEQ

    • UNKNOWN

  • experiment_type

    An optional enum list as experiment type information, default is UNKNOWN, allowed values are

    • POLYA-RNA

    • POLYA-RNA-3P

    • TOTAL-RNA

    • SMALL-RNA

    • WGS

    • WGA

    • WXS

    • WXS-UTR

    • RIBOSOME-PROFILING

    • RIBODEPLETION

    • 16S

    • NCRNA-SEQ

    • FL-CDNA

    • EST

    • HI-C

    • DNASE-SEQ

    • WCS

    • RAD-SEQ

    • CLONE

    • POOLCLONE

    • AMPLICON

    • CLONEEND

    • FINISHING

    • DNASE-HYPERSENSITIVITY

    • RRBS-SEQ

    • WGBS

    • CTS

    • MRE-SEQ

    • MEDIP-SEQ

    • MBD-SEQ

    • TN-SEQ

    • VALIDATION

    • FAIRE-SEQ

    • SELEX

    • RIP-SEQ

    • CHIA-PET

    • SYNTHETIC-LONG-READ

    • TARGETED-CAPTURE

    • TETHERED

    • NOME-SEQ

    • CHIRP-SEQ

    • 4-C-SEQ

    • 5-C-SEQ

    • METAGENOMIC

    • METATRANSCRIPTOMIC

    • TF

    • H3K27ME3

    • H3K27AC

    • H3K9ME3

    • H3K36ME3

    • H3F3A

    • H3K4ME1

    • H3K79ME2

    • H3K79ME3

    • H3K9ME1

    • H3K9ME2

    • H4K20ME1

    • H2AFZ

    • H3AC

    • H3K4ME2

    • H3K4ME3

    • H3K9AC

    • HISTONE-NARROW

    • HISTONE-BROAD

    • CHIP-INPUT

    • ATAC-SEQ

    • TENX-TRANSCRIPTOME-3P

    • TENX-TRANSCRIPTOME-5P

    • DROP-SEQ-TRANSCRIPTOME

    • UNKNOWN

  • library_layout

    An optional enum list to specify library layout, default is UNONWN allowed values are

    • SINGLE

    • PAIRED

    • UNKNOWN

  • status

    An optional enum list to specify experiment status, default is ACTIVE, allowed values are

    • ACTIVE

    • FAILED

    • WITHDRAWN

  • date_created – An optional timestamp column to record entry creation or modification time, default current timestamp

  • platform_name

    An optional enum list to specify platform model, default is UNKNOWN, allowed values are

    • HISEQ250

    • HISEQ4000

    • MISEQ

    • NEXTSEQ

    • NANOPORE_MINION

    • DNBSEQ-G400

    • DNBSEQ-G50

    • DNBSEQ-T7

    • UNKNOWN

class igf_data.igfdb.igfTables.Experiment_attribute(**kwargs)

A table for loading experiment attributes

Parameters
  • experiment_attribute_id – An integer id for experiment_attribute table

  • attribute_name – An optional string attribute name, allowed length 30

  • attribute_value – An optional string attribute value, allowed length 50

  • experiment_id – An integer id from experiment table (foreign key)

class igf_data.igfdb.igfTables.File(**kwargs)

A table for loading file information

Parameters
  • file_id – An integer id for file table

  • file_path – A required string to specify file path information, allowed length 500

  • location

    An optional enum list to specify storage location, default UNKNOWN, allowed values are

    • ORWELL

    • HPC_PROJECT

    • ELIOT

    • IRODS

    • UNKNOWN

  • status

    An optional enum list to specify experiment status, default is ACTIVE, allowed values are

    • ACTIVE

    • FAILED

    • WITHDRAWN

  • md5 – An optional string to specify file md5 value, allowed length 33

  • size – An optional string to specify file size, allowed value 15

  • date_created – An optional timestamp column to record file creation time, default current timestamp

  • date_updated – An optional timestamp column to record file modification time, default current timestamp

class igf_data.igfdb.igfTables.File_attribute(**kwargs)

A table for loading file attributes

Parameters
  • file_attribute_id – An integer id for file_attribute table

  • attribute_name – An optional string attribute name, allowed length 30

  • attribute_value – An optional string attribute value, allowed length 50

  • file_id – An integer id from file table (foreign key)

class igf_data.igfdb.igfTables.Flowcell_barcode_rule(**kwargs)

A table for loading flowcell specific barcode rules information

Parameters
  • flowcell_rule_id – An integer id for flowcell_barcode_rule table

  • platform_id – An integer id for platform table (foreign key)

  • flowcell_type – A required string as flowcell type name, allowed length 50

  • index_1

    An optional enum list as index_1 specific rule, default UNKNOWN, allowed values are

    • NO_CHANGE

    • REVCOMP

    • UNKNOWN

  • index_2

    An optional enum list as index_2 specific rule, default UNKNOWN, allowed values are

    • NO_CHANGE

    • REVCOMP

    • UNKNOWN

class igf_data.igfdb.igfTables.History(**kwargs)

A table for loading history information

Parameters
  • log_id – An integer id for history table

  • log_type

    A required enum value to specify log type, allowed values are

    • CREATED

    • MODIFIED

    • DELETED

  • table_name

    A required enum value to specify table information, allowed values are

    • PROJECT

    • USER

    • SAMPLE

    • EXPERIMENT

    • RUN

    • COLLECTION

    • FILE

    • PLATFORM

    • PROJECT_ATTRIBUTE

    • EXPERIMENT_ATTRIBUTE

    • COLLECTION_ATTRIBUTE

    • SAMPLE_ATTRIBUTE

    • RUN_ATTRIBUTE

    • FILE_ATTRIBUTE

  • log_date – An optional timestamp column to record file creation or modification time, default current timestamp

  • message – An optional text field to specify message

class igf_data.igfdb.igfTables.Pipeline(**kwargs)

A table for loading pipeline information

Parameters
  • pipeline_id – An integer id for pipeline table

  • pipeline_name – A required string to specify pipeline name, allowed length 50

  • pipeline_db – A required string to specify pipeline database url, allowed length 200

  • pipeline_init_conf – An optional json field to specify initial pipeline configuration

  • pipeline_run_conf – An optional json field to specify modified pipeline configuration

  • pipeline_type

    An optional enum list to specify pipeline type, default EHIVE, allowed values are

    • EHIVE

    • UNKNOWN

  • is_active – An optional enum list to specify the status of pipeline, default Y, allowed values are Y and N

  • date_stamp – An optional timestamp column to record file creation or modification time, default current timestamp

class igf_data.igfdb.igfTables.Pipeline_seed(**kwargs)

A table for loading pipeline seed information

Parameters
  • pipeline_seed_id – An integer id for pipeline_seed table

  • seed_id – A required integer id

  • seed_table – An optional enum list to specify seed table information, default unknown, allowed values project, sample, experiment, run, file, seqrun, collection and unknown

  • pipeline_id – An integer id from pipeline table (foreign key)

  • status

    An optional enum list to specify the status of pipeline, default UNKNOWN, allowed values are

    • SEEDED

    • RUNNING

    • FINISHED

    • FAILED

    • UNKNOWN

  • date_stamp – An optional timestamp column to record file creation or modification time, default current timestamp

class igf_data.igfdb.igfTables.Platform(**kwargs)

A table for loading sequencing platform information

Parameters
  • platform_id – An integer id for platform table

  • platform_igf_id – A required string as platform id specific to IGF team, allowed length 10

  • model_name

    A required enum list to specify platform model, allowed values are

    • HISEQ2500

    • HISEQ4000

    • MISEQ

    • NEXTSEQ

    • NOVASEQ6000

    • NANOPORE_MINION

    • DNBSEQ-G400

    • DNBSEQ-G50

    • DNBSEQ-T7

  • vendor_name

    A required enum list to specify vendor’s name, allowed values are

    • ILLUMINA

    • NANOPORE

    • MGI

  • software_name

    A required enum list for specifying platform software, allowed values are

    • RTA

    • UNKNOWN

  • software_version – A optional software version number, default is UNKNOWN

  • date_created – An optional timestamp column to record entry creation time, default current timestamp

class igf_data.igfdb.igfTables.Project(**kwargs)

A table for loading project information

Parameters
  • project_id – An integer id for project table

  • project_igf_id – A required string as project id specific to IGF team, allowed length 50

  • project_name – An optional string as project name

  • start_timestamp – An optional timestamp for project creation, default current timestamp

  • description – An optional text column to document project description

  • deliverable

    An enum list to document project deliverable, default FASTQ, allowed entries are

    • FASTQ

    • ALIGNMENT

    • ANALYSIS

  • status

    An enum list for project status, default ACTIVE allowed entries are

    • ACTIVE

    • FINISHED

    • WITHDRAWN

class igf_data.igfdb.igfTables.ProjectUser(**kwargs)

A table for linking users to the projects

Parameters
  • project_user_id – An integer id for project_user table

  • project_id – An integer id for project table (foreign key)

  • user_id – An integer id for user table (foreign key)

  • data_authority – An optional enum value to denote primary user for the project, allowed value T

class igf_data.igfdb.igfTables.Project_attribute(**kwargs)

A table for loading project attributes

Parameters
  • project_attribute_id – An integer id for project_attribute table

  • attribute_name – An optional string attribute name, allowed length 50

  • attribute_value – An optional string attribute value, allowed length 50

  • project_id – An integer id from project table (foreign key)

class igf_data.igfdb.igfTables.Run(**kwargs)

A table for loading run (unique combination of experiment, sequencing flowcell and lane) information

Parameters
  • run_id – An integer id for run table

  • run_igf_id – A required string as run id specific to IGF team, allowed length 70

  • experiment_id – A required integer id from experiment table (foreign key)

  • seqrun_id – A required integer id from seqrun table (foreign key)

  • status

    An optional enum list to specify experiment status, default is ACTIVE, allowed values are

    • ACTIVE

    • FAILED

    • WITHDRAWN

  • lane_number – A required enum list for specifying lane information, allowed values 1, 2, 3, 4, 5, 6, 7 and 8

  • date_created – An optional timestamp column to record entry creation time, default current timestamp

class igf_data.igfdb.igfTables.Run_attribute(**kwargs)

A table for loading run attributes

Parameters
  • run_attribute_id – An integer id for run_attribute table

  • attribute_name – An optional string attribute name, allowed length 30

  • attribute_value – An optional string attribute value, allowed length 50

  • run_id – An integer id from run table (foreign key)

class igf_data.igfdb.igfTables.Sample(**kwargs)

A table for loading sample information

Parameters
  • sample_id – An integer id for sample table

  • sample_igf_id – A required string as sample id specific to IGF team, allowed length 20

  • sample_submitter_id – An optional string as sample name from user, allowed value 40

  • taxon_id – An optional integer NCBI taxonomy information for sample

  • scientific_name – An optional string as scientific name of the species

  • species_name – An optional string as the species name (genome build code) information

  • donor_anonymized_id – An optional string as anonymous donor name

  • description – An optional string as sample description

  • phenotype – An optional string as sample phenotype information

  • sex

    An optional enum list to specify sample sex, default UNKNOWN allowed values are

    • FEMALE

    • MALE

    • MIXED

    • UNKNOWN

  • status

    An optional enum list to specify sample status, default ACTIVE, allowed values are

    • ACTIVE

    • FAILED

    • WITHDRAWS

  • biomaterial_type

    An optional enum list as sample biomaterial type, default UNKNOWN, allowed values are

    • PRIMARY_TISSUE

    • PRIMARY_CELL

    • PRIMARY_CELL_CULTURE

    • CELL_LINE

    • SINGLE_NUCLEI

    • UNKNOWN

  • cell_type – An optional string to specify sample cell_type information, if biomaterial_type is PRIMARY_CELL or PRIMARY_CELL_CULTURE

  • tissue_type – An optional string to specify sample tissue information, if biomaterial_type is PRIMARY_TISSUE

  • cell_line – An optional string to specify cell line information ,if biomaterial_type is CELL_LINE

  • date_created – An optional timestamp column to specify entry creation date, default current timestamp

  • project_id – An integer id for project table (foreign key)

class igf_data.igfdb.igfTables.Sample_attribute(**kwargs)

A table for loading sample attributes

Parameters
  • sample_attribute_id – An integer id for sample_attribute table

  • attribute_name – An optional string attribute name, allowed length 50

  • attribute_value – An optional string attribute value, allowed length 50

  • sample_id – An integer id from sample table (foreign key)

class igf_data.igfdb.igfTables.Seqrun(**kwargs)

A table for loading sequencing run information

Parameters
  • seqrun_id – An integer id for seqrun table

  • seqrun_igf_id – A required string as seqrun id specific to IGF team, allowed length 50

  • reject_run – An optional enum list to specify rejected run information ,default N, allowed values Y and N

  • date_created – An optional timestamp column to record entry creation time, default current timestamp

  • flowcell_id – A required string column for storing flowcell_id information, allowed length 20

  • platform_id – An integer platform id (foreign key)

class igf_data.igfdb.igfTables.Seqrun_attribute(**kwargs)

A table for loading seqrun attributes

Parameters
  • seqrun_attribute_id – An integer id for seqrun_attribute table

  • attribute_name – An optional string attribute name, allowed length 50

  • attribute_value – An optional string attribute value, allowed length 100

  • seqrun_id – An integer id from seqrun table (foreign key)

class igf_data.igfdb.igfTables.Seqrun_stats(**kwargs)

A table for loading sequencing stats information

Parameters
  • seqrun_stats_id – An integer id for seqrun_stats table

  • seqrun_id – An integer seqrun id (foreign key)

  • lane_number – A required enum list for specifying lane information, allowed values are 1, 2, 3, 4, 5, 6, 7 and 8

  • bases_mask – An optional string field for storing bases mask information

  • undetermined_barcodes – An optional json field to store barcode info for undetermined samples

  • known_barcodes – An optional json field to store barcode info for known samples

  • undetermined_fastqc – An optional json field to store qc info for undetermined samples

class igf_data.igfdb.igfTables.User(**kwargs)

A table for loading user information

Parameters
  • user_id – An integer id for user table

  • user_igf_id – An optional string as user id specific to IGF team, allowed length 10

  • name – A required string as user name, allowed length 30

  • email_id – A required string as email id, allowed length 40

  • username – A required string as IGF username, allowed length 20

  • hpc_username – An optional string as Imperial College’s HPC login name, allowed length 20

  • twitter_user – An optional string as twitter user name, allowed length 20

  • category

    An optional enum list as user category, default NON_HPC_USER, allowed values are

    • HPC_USER

    • NON_HPC_USER

    • EXTERNAL

  • status

    An optional enum list as user status, default is ACTIVE, allowed values are

    • ACTIVE

    • BLOCKED

    • WITHDRAWN

  • date_created – An optional timestamp, default current timestamp

  • password – An optional string field to store encrypted password

  • encryption_salt – An optional string field to store encryption salt

  • ht_password – An optional field to store password for htaccess

Database adaptor api

Base adaptor

class igf_data.igfdb.baseadaptor.BaseAdaptor(**data)

The base adaptor class

divide_data_to_table_and_attribute(data, required_column, table_columns, attribute_name_column='attribute_name', attribute_value_column='attribute_value')

A method for separating data for main and attribute tables

Parameters
  • data – a dictionary or dataframe containing the data

  • required_column – column to add to the attribute table, it must be part of the data

  • table_columns – required columns for the main table

  • attribute_name_column – column label for attribute name

  • attribute_value_column – column label for attribute value

Returns

Two pandas dataframes, one for main table and one for attribute tables

fetch_records(query, output_mode='dataframe')

A method for fetching records using a query

Parameters
  • query – A sqlalchmeny query object

  • output_mode – dataframe / object / one / one_or_none

Returns

A pandas dataframe for dataframe mode and a generator object for object mode

fetch_records_by_column(table, column_name, column_id, output_mode)

A method for fetching record with the column

Parameters
  • table – table name

  • column_name – a column name

  • column_id – a column id value

  • output_mode – dataframe / object / one / one_or_none

fetch_records_by_multiple_column(table, column_data, output_mode)

A method for fetching record with the column

Parameters
  • table – table name

  • column_dict – a dictionary of column_names: column_value

  • output_mode – dataframe / object/ one / one_or_none

get_attributes_by_dbid(attribute_table, linked_table, linked_column_name, db_id)

A method for fetching attribute records for a specific attribute table with a db_id linked as foreign key

Parameters
  • attribute_table – A attribute table object

  • linked_table – A main table object

  • linked_column_name – A table name to link main table

  • db_id – A unique id to link main table

:returns a dataframe of records

get_table_columns(table_name, excluded_columns)

A method for fetching the columns for table table_name

Parameters
  • table_name – a table class name

  • excluded_columns – a list of column names to exclude from output

map_foreign_table_and_store_attribute(data, lookup_table, lookup_column_name, target_column_name)

A method for mapping foreign key id to the new column

Parameters
  • data – a data dictionary or pandas series, to be stored in attribute table

  • lookup_table – a table class to look for the foreign key id

  • lookup_column_name – a string or a list of column names which will be used to link the data frame with lookup_table, this column will be removed from the output series

  • target_column_name – column name for the foreign key id

Returns

A data series

store_attributes(attribute_table, data, linked_column='', db_id='', mode='serial')

A method for storing attributes

Parameters
  • attribute_table – a attribute table name

  • linked_column – a column name to link the db_id to attribute table

  • db_id – a db_id to link the attribute records

  • mode – serial / bulk

store_records(table, data, mode='serial')

A method for loading data to table

Parameters

table – name of the table class

:param data : pandas dataframe or a list of dictionary :param mode : serial / bulk

Project adaptor

class igf_data.igfdb.projectadaptor.ProjectAdaptor(**data)

An adaptor class for Project, ProjectUser and Project_attribute tables

assign_user_to_project(data, required_project_column='project_igf_id', required_user_column='email_id', data_authority_column='data_authority', autosave=True)

Load data to ProjectUser table

Parameters
  • data – A list of dictionaries, each containing ‘project_igf_id’ and ‘user_igf_id’ as key with relevent igf ids as the values. An optional key ‘data_authority’ with boolean value can be provided to set the user as the data authority of the project E.g. [{‘project_igf_id’: val, ‘email_id’: val, ‘data_authority’:True},]

  • required_project_column – Name of the project id column, default project_igf_id

  • required_user_column – Name of the user id column, default email_id

  • data_authority_column – Name of the data_authority column, default data_authority

  • autosave – A toggle for autocommit to db, default True

Returns

None

check_data_authority_for_project(project_igf_id)

A method for checking user data authority for existing projects

Parameters

project_igf_id – An unique project igf id

Returns

True if data authority exists for project or false

check_existing_project_user(project_igf_id, email_id)

A method for checking existing project use info in database

Parameters
  • project_igf_id – A project_igf_id

  • email_id – An email_id

Returns

True if the file is present in db or False if its not

check_project_attributes(project_igf_id, attribute_name)

A method for checking existing project attribute in database

Parameters
  • project_igf_id – An unique project igf id

  • attribute_name – An attribute name

:return A boolean value

check_project_records_igf_id(project_igf_id, target_column_name='project_igf_id')

A method for checking existing data for Project table

Parameters
  • project_igf_id – Project igf id name

  • target_column_name – Name of the project id column, default project_igf_id

Returns

True if the file is present in db or False if its not

count_project_samples(project_igf_id, only_active=True)

A method for counting total number of samples for a project

Parameters
  • project_igf_id – A project id

  • only_active – Toggle for including only active projects, default is True

Returns

A int sample count

divide_data_to_table_and_attribute(data, required_column='project_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')

A method for separating data for Project and Project_attribute tables

Parameters
  • data – A list of dictionaries or a pandas dataframe

  • table_columns – List of table column names, default None

  • required_column – Name of the required column, default project_igf_id

  • attribute_name_column – Value for attribute name column, default attribute_name

  • attribute_value_column – Valye for attribute value column, default attribute_value

Returns

A project dataframe and a project attribute dataframe

fetch_all_project_igf_ids(output_mode='dataframe')

A method for fetching a list of all project igf ids

Parameters

output_mode – Output mode, default dataframe

fetch_data_authority_for_project(project_igf_id)

A method for fetching user data authority for existing projects

Parameters

project_igf_id – An unique project igf id

Returns

A user object or None, if no entry found

fetch_project_records_igf_id(project_igf_id, target_column_name='project_igf_id')

A method for fetching data for Project table

Parameters
  • project_igf_id – an igf id

  • output_mode – dataframe / object / one

Returns

Records from project table

fetch_project_samples(project_igf_id, only_active=True, output_mode='object')

A method for fetching all the samples for a specific project

Parameters
  • project_igf_id – A project id

  • only_active – Toggle for including only active projects, default is True

  • output_mode – Output mode, default object

Returns

Depends on the output_mode, a generator expression, dataframe or an object

get_project_attributes(project_igf_id, linked_column_name='project_id', attribute_name='')

A method for fetching entries from project attribute table

Parameters
  • project_igf_id – A project_igf_id string

  • attribute_name – An attribute name, default in None

  • linked_column_name – A column name for linking attribute table

:returns dataframe of records

get_project_user_info(output_mode='dataframe', project_igf_id='')

A method for fetching information from Project, User and ProjectUser table

Parameters

project_igf_id – a project igf id

:param output_mode : dataframe / object :returns: Records for project user

store_project_and_attribute_data(data, autosave=True)

A method for dividing and storing data to project and attribute_table

Parameters
  • data – A list of data or a pandas dataframe

  • autosave – A toggle for autocommit, default True

Returns

None

store_project_attributes(data, project_id='', autosave=False)

A method for storing data to Project_attribute table

Parameters
  • data – A pandas dataframe

  • project_id – Project id for attribute table, default ‘’

  • autosave – A toggle for autocommit, default False

Returns

None

store_project_data(data, autosave=False)

Load data to Project table

Parameters
  • data – A list of data or a pandas dataframe

  • autosave – A toggle for autocommit, default False

Returns

None

User adaptor

class igf_data.igfdb.useradaptor.UserAdaptor(**data)

An adaptor class for table User

check_user_records_email_id(email_id)

A method for checking existing user data in db

Parameters

email_id – An email id

Returns

True if the file is present in db or False if its not

fetch_user_records_email_id(user_email_id)

A method for fetching data for User table

Parameters

user_email_id – an email id

Returns

user object

fetch_user_records_igf_id(user_igf_id)

A method for fetching data for User table

Parameters

user_igf_id – an igf id

Returns

user object

store_user_data(data, autosave=True)

Load data to user table

Parameters
  • data – A pandas dataframe

  • autosave – A toggle for autocommit, default True

Returns

None

Sample adaptor

class igf_data.igfdb.sampleadaptor.SampleAdaptor(**data)

An adaptor class for Sample and Sample_attribute tables

check_project_and_sample(project_igf_id, sample_igf_id)

A method for checking existing project and sample igf id combination in sample table

Parameters
  • project_igf_id – A project igf id string

  • sample_igf_id – A sample igf id string

Returns

True if target entry is present or return False

check_sample_records_igf_id(sample_igf_id, target_column_name='sample_igf_id')

A method for checking existing data for sample table

Parameters
  • sample_igf_id – an igf id

  • target_column_name – name of the target lookup column, default sample_igf_id

Returns

True if the file is present in db or False if its not

divide_data_to_table_and_attribute(data, required_column='sample_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')

A method for separating data for Sample and Sample_attribute tables

Parameters
  • data – A list of dictionaries or a pandas dataframe

  • table_columns – List of table column names, default None

  • required_column – column name to add to the attribute data

  • attribute_name_column – label for attribute name column

  • attribute_value_column – label for attribute value column

Returns

Two pandas dataframes, one for Sample and another for Sample_attribute table

fetch_sample_project(sample_igf_id)

A method for fetching project information for the sample

Parameters

sample_igf_id – A sample_igf_id for database lookup

Returns

A project_igf_id or None, if not found

fetch_sample_records_igf_id(sample_igf_id, target_column_name='sample_igf_id')

A method for fetching data for Sample table

Parameters
  • sample_igf_id – A sample igf id

  • output_mode – dataframe, object, one or on_on_none

Returns

An object or dataframe, based on the output_mode

store_sample_and_attribute_data(data, autosave=True)

A method for dividing and storing data to sample and attribute table

store_sample_attributes(data, sample_id='', autosave=False)

A method for storing data to Sample_attribute table

Parameters
  • data – A dataframe or list of dictionary containing the Sample_attribute data

  • sample_id – An optional parameter to link the sample attributes to a specific sample

store_sample_data(data, autosave=False)

Load data to Sample table

Parameters

data – A dataframe or list of dictionary containing the data

Experiment adaptor

class igf_data.igfdb.experimentadaptor.ExperimentAdaptor(**data)

An adaptor class for Experiment and Experiment_attribute tables

check_experiment_records_id(experiment_igf_id, target_column_name='experiment_igf_id')

A method for checking existing data for Experiment table

Parameters
  • experiment_igf_id – an igf id

  • target_column_name – a column name, default experiment_igf_id

Returns

True if the file is present in db or False if its not

divide_data_to_table_and_attribute(data, required_column='experiment_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')

A method for separating data for Experiment and Experiment_attribute tables

Parameters
  • data – A list of dictionaries or a Pandas DataFrame

  • table_columns – List of table column names, default None

  • required_column – column name to add to the attribute data

  • attribute_name_column – label for attribute name column

  • attribute_value_column – label for attribute value column

Returns

Two pandas dataframes, one for Experiment and another for Experiment_attribute table

fetch_experiment_records_id(experiment_igf_id, target_column_name='experiment_igf_id')

A method for fetching data for Experiment table

Parameters
  • experiment_igf_id – an igf id

  • target_column_name – a column name, default experiment_igf_id

Returns

Experiment object

fetch_project_and_sample_for_experiment(experiment_igf_id)

A method for fetching project and sample igf id information for an experiment

Parameters

experiment_igf_id – An experiment igf id string

Returns

Two strings, project igf id and sample igd id, or None if not found

fetch_runs_for_igf_id(experiment_igf_id, include_active_runs=True, output_mode='dataframe')

A method for fetching all the runs for a specific experiment_igf_id

Parameters
  • experiment_igf_id – An experiment_igf_id

  • include_active_runs – Include only active runs, if its True, default True

  • output_mode – Record fetch mode, default dataframe

fetch_sample_attribute_records_for_experiment_igf_id(experiment_igf_id, output_mode='dataframe', attribute_list=None)

A method for fetching sample_attribute_records for a given experiment_igf_id

Parameters
  • experiment_igf_id – An experiment_igf_id

  • output_mode – Result output mode, default dataframe

  • attribute_list – A list of attributes for database lookup, default None

:returns an object or dataframe based on the output_mode

store_experiment_attributes(data, experiment_id='', autosave=False)

A method for storing data to Experiment_attribute table

Parameters
  • data – A list of dictionaries or a Pandas DataFrame for experiment attribute data

  • experiment_id – An optional experiment_id to link attribute records

  • autosave – A toggle for automatically saving data to db, default True

store_experiment_data(data, autosave=False)

Load data to Experiment table

Parameters
  • data – A list of dictionaries or a Pandas DataFrame

  • autosave – A toggle for automatically saving data to db, default True

store_project_and_attribute_data(data, autosave=True)

A method for dividing and storing data to experiment and attribute table

Parameters
  • data – A list of dictionaries or a Pandas DataFrame

  • autosave – A toggle for automatically saving data to db, default True

update_experiment_records_by_igf_id(experiment_igf_id, update_data, autosave=True)

A method for updating experiment records in database

Parameters
  • experiment_igf_id – An igf ids for the experiment data lookup

  • update_data – A dictionary containing the updated entries

  • autosave – Toggle auto commit after database update, default True

Run adaptor

class igf_data.igfdb.runadaptor.RunAdaptor(**data)

An adaptor class for Run and Run_attribute tables

check_run_records_igf_id(run_igf_id, target_column_name='run_igf_id')

A method for existing data for Run table

Parameters
  • run_igf_id – an igf id

  • target_column_name – a column name, default run_igf_id

Returns

True if the file is present in db or False if its not

divide_data_to_table_and_attribute(data, required_column='run_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')

A method for separating data for Run and Run_attribute tables

Parameters
  • data – A list of dictionaries or a Pandas DataFrame

  • table_columns – List of table column names, default None

  • required_column – column name to add to the attribute data

  • attribute_name_column – label for attribute name column

  • attribute_value_column – label for attribute value column

Returns

Two pandas dataframes, one for Run and another for Run_attribute table

fetch_flowcell_and_lane_for_run(run_igf_id)

A run adapter method for fetching flowcell id and lane info for each run

Parameters

run_igf_id – A run igf id string

Returns

Flowcell id and lane number It will return None if no records found

fetch_project_sample_and_experiment_for_run(run_igf_id)

A method for fetching project, sample and experiment information for a run

Parameters

run_igf_id – A run igf id string

Returns

A list of three strings, or None if not found * project_igf_id * sample_igf_id * experiment_igf_id

fetch_run_records_igf_id(run_igf_id, target_column_name='run_igf_id')

A method for fetching data for Run table

Parameters
  • run_igf_id – an igf id

  • target_column_name – a column name, default run_igf_id

fetch_sample_info_for_run(run_igf_id)

A method for fetching sample information linked to a run_igf_id

Parameters

run_igf_id – A run_igf_id to search database

store_run_and_attribute_data(data, autosave=True)

A method for dividing and storing data to run and attribute table

Parameters
  • data – A list of dictionaries or a Pandas DataFrame containing the run data

  • autosave – A toggle for saving data automatically to db, default True

store_run_attributes(data, run_id='', autosave=False)

A method for storing data to Run_attribute table

Parameters
  • data – A list of dictionaries or a Pandas DataFrame containing the attribute data

  • autosave – A toggle for saving data automatically to db, default True

store_run_data(data, autosave=False)

A method for loading data to Run table

Parameters
  • data – A list of dictionaries or a Pandas DataFrame containing the attribute data

  • autosave – A toggle for saving data automatically to db, default True

Collection adaptor

class igf_data.igfdb.collectionadaptor.CollectionAdaptor(**data)

An adaptor class for Collection, Collection_group and Collection_attribute tables

check_collection_attribute(collection_name, collection_type, attribute_name)

A method for checking collection attribute records for an attribute_name

Parameters
  • collection_name – A collection name

  • collection_type – A collection type

  • attribute_name – A collection attribute name

Returns

Boolean, True if record exists or False

check_collection_records_name_and_type(collection_name, collection_type)

A method for checking existing data for Collection table

Parameters
  • collection_name – a collection name value

  • collection_type – a collection type value

Returns

True if the file is present in db or False if its not

create_collection_group(data, autosave=True, required_collection_column=('name', 'type'), required_file_column='file_path')

A function for creating collection group, a link between a file and a collection

Parameters
  • data

    A list dictionary or a Pandas DataFrame with following columns
    • name

    • type

    • file_path

    E.g. [{‘name’:’a collection name’, ‘type’:’a collection type’, ‘file_path’: ‘path’},]

  • required_collection_column – List of required column for fetching collection, default ‘name’,’type’

  • required_file_column – Required column for fetching file information, default file_path

  • autosave – A toggle for saving changes to database, default True

create_or_update_collection_attributes(data, autosave=True)

A method for creating or updating collection attribute table, if the collection exists

Parameters
  • data

    A list of dictionaries, containing following entries

    • name

    • type

    • attribute_name

    • attribute_value

  • autosave – A toggle for saving changes to database, default True

divide_data_to_table_and_attribute(data, required_column=('name', 'type'), table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')

A method for separating data for Collection and Collection_attribute tables

Parameters
  • data – A list of dictionaries or a pandas dataframe

  • table_columns – List of table column names, default None

  • required_column – column name to add to the attribute data, default ‘name’, ‘type’

  • attribute_name_column – label for attribute name column, default attribute_name

  • attribute_value_column – label for attribute value column, default attribute_value

Returns

Two pandas dataframes, one for Collection and another for Collection_attribute table

fetch_collection_name_and_table_from_file_path(file_path)

A method for fetching collection name and collection_table info using the file_path information. It will return None if the file doesn’t have any collection present in the database

Parameters

file_path – A filepath info

Returns

Collection name and collection table for first collection group

fetch_collection_records_name_and_type(collection_name, collection_type, target_column_name=('name', 'type'))

A method for fetching data for Collection table

Parameters
  • collection_name – a collection name value

  • collection_type – a collection type value

  • target_column_name – a list of columns, default is [‘name’,’type’]

get_collection_files(collection_name, collection_type='', collection_table='', output_mode='dataframe')

A method for fetching information from Collection, File, Collection_group tables

Parameters
  • collection_name – A collection name to fetch the linked files

  • collection_type – A collection type

  • collection_table – A collection table

  • output_mode – dataframe / object

load_file_and_create_collection(data, autosave=True, hasher='md5', calculate_file_size_and_md5=True, required_coumns=('name', 'type', 'table', 'file_path', 'size', 'md5', 'location'))

A function for loading files to db and creating collections

Parameters
  • data – A list of dictionary or a Pandas dataframe

  • autosave – Save data to db, default True

  • required_coumns – List of required columns

  • hasher – Method for file checksum, default md5

  • calculate_file_size_and_md5 – Enable file size and md5 check, default True

static prepare_data_for_collection_attribute(collection_name, collection_type, data_list)

A static method for building data structure for collection attribute table update

Parameters
  • collection_name – A collection name

  • collection_type – A collection type

  • data – A list of dictionaries containing the data for attribute table

Returns

A new list of dictionary for the collection attribute table

remove_collection_group_info(data, autosave=True, required_collection_column=('name', 'type'), required_file_column='file_path')

A method for removing collection group information from database

Parameters
  • data

    A list dictionary or a Pandas DataFrame with following columns
    • name

    • type

    • file_path

    File_path information is not mandatory

  • required_collection_column – List of required column for fetching collection, default ‘name’,’type’

  • required_file_column – Required column for fetching file information, default file_path

  • autosave – A toggle for saving changes to database, default True

store_collection_and_attribute_data(data, autosave=True)

A method for dividing and storing data to collection and attribute table

Parameters
  • data – A list of dictionary or a Pandas DataFrame

  • autosave – A toggle for saving changes to database, default True

store_collection_attributes(data, collection_id='', autosave=False)

A method for storing data to Collectionm_attribute table

Parameters
  • data – A list of dictionary or a Pandas DataFrame

  • collection_id – A collection id, optional

  • autosave – A toggle for saving changes to database, default False

store_collection_data(data, autosave=False)

A method for loading data to Collection table

Parameters
  • data – A list of dictionary or a Pandas DataFrame

  • autosave – A toggle for saving changes to database, default True

update_collection_attribute(collection_name, collection_type, attribute_name, attribute_value, autosave=True)

A method for updating collection attribute

Parameters
  • collection_name – A collection name

  • collection_type – A collection type

  • attribute_name – A collection attribute name

  • attribute_value – A collection attribute value

  • autosave – A toggle for committing changes to db, default True

File adaptor

class igf_data.igfdb.fileadaptor.FileAdaptor(**data)

An adaptor class for File tables

check_file_records_file_path(file_path)

A method for checking file information in database

Parameters

file_path – A absolute filepath

Returns

True if the file is present in db or False if its not

divide_data_to_table_and_attribute(data, required_column='file_path', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')

A method for separating data for File and File_attribute tables

Parameters
  • data – A list of dictionary or a Pandas DataFrame

  • table_columns – List of table column names, default None

  • required_column – A column name to add to the attribute data

  • attribute_name_column – A label for attribute name column

  • attribute_value_column – A label for attribute value column

Returns

Two pandas dataframes, one for File and another for File_attribute table

fetch_file_records_file_path(file_path)

A method for fetching data for file table

Parameters

file_path – an absolute file path

Returns

A file object

remove_file_data_for_file_path(file_path, remove_file=False, autosave=True)

A method for removing entry for a specific file.

Parameters
  • file_path – A complete file_path for checking database

  • remove_file – A toggle for removing filepath, default False

  • autosave – A toggle for automatically saving changes to database, default True

store_file_and_attribute_data(data, autosave=True)

A method for dividing and storing data to file and attribute table

Parameters
  • data – A list of dictionary or a Pandas DataFrame

  • autosave – A Toggle for automatically saving changes to db, default True

store_file_attributes(data, file_id='', autosave=False)

A method for storing data to File_attribute table

Parameters
  • data – A list of dictionary or a Pandas DataFrame

  • file_id – A file_id for updating the attribute table, default empty string

  • autosave – A Toggle for automatically saving changes to db, default True

store_file_data(data, autosave=False)

Load data to file table

Parameters
  • data – A list of dictionary or a Pandas DataFrame

  • autosave – A Toggle for automatically saving changes to db, default True

update_file_table_for_file_path(file_path, tag, value, autosave=False)

A method for updating file table

Parameters
  • file_path – A file_path for database look up

  • tag – A keyword for file column name

  • value – A new value for the file column

  • autosave – Toggle autosave, default off

Sequencing run adaptor

class igf_data.igfdb.seqrunadaptor.SeqrunAdaptor(**data)

An adaptor class for table Seqrun

divide_data_to_table_and_attribute(data, required_column='seqrun_igf_id', table_columns=None, attribute_name_column='attribute_name', attribute_value_column='attribute_value')

A method for separating data for Seqrun and Seqrun_attribute tables

Parameters
  • data – A list of dictionaries or a pandas dataframe

  • table_columns – List of table column names, default None

  • required_column – column name to add to the attribute data

  • attribute_name_column – label for attribute name column

  • attribute_value_column – label for attribute value column

Returns

two pandas dataframes, one for Seqrun and another for Run_attribute table

fetch_flowcell_barcode_rules_for_seqrun(seqrun_igf_id, flowcell_label='flowcell')

A method for fetching flowcell barcode rule for Seqrun required param: seqrun_igf_id: A seqrun igf id

fetch_seqrun_records_igf_id(seqrun_igf_id, target_column_name='seqrun_igf_id')

A method for fetching data for Seqrun table required params: seqrun_igf_id: an igf id target_column_name: a column name in the Seqrun table, default seqrun_igf_id

store_seqrun_and_attribute_data(data, autosave=True)

A method for dividing and storing data to seqrun and attribute table

store_seqrun_attributes(data, seqrun_id='', autosave=False)

A method for storing data to Seqrun_attribute table

store_seqrun_data(data, autosave=False)

Load data to Seqrun table

store_seqrun_stats_data(data, seqrun_id='', autosave=True)

A method for storing data to seqrun_stats table

Platform adaptor

class igf_data.igfdb.platformadaptor.PlatformAdaptor(**data)

An adaptor class for Platform tables

fetch_platform_records_igf_id(platform_igf_id, target_column_name='platform_igf_id', output_mode='one')

A method for fetching data for Platform table

Parameters
  • platform_igf_id – an igf id

  • target_column_name – column name in the Platform table, default is platform_igf_id

store_flowcell_barcode_rule(data, autosave=True)

Load data to flowcell_barcode_rule table required params: data: A dictionary or dataframe containing following columns

  • platform_igf_id / platform_id

  • flowcell_type

  • index_1 (NO_CHANGE/REVCOMP/UNKNOWN)

  • index_2 (NO_CHANGE/REVCOMP/UNKNOWN)

store_platform_data(data, autosave=True)

Load data to Platform table

Pipeline adaptor

class igf_data.igfdb.pipelineadaptor.PipelineAdaptor(**data)

An adaptor class for Pipeline and Pipeline_seed tables

create_pipeline_seed(data, autosave=True, status_column='status', seeded_label='SEEDED', required_columns=('pipeline_id', 'seed_id', 'seed_table'))

A method for creating new entry in th pipeline_seed table

Parameters

data – Dataframe or hash, it sould contain following fields * pipeline_name / pipeline_id * seed_id * seed_table

fetch_pipeline_records_pipeline_name(pipeline_name, target_column_name='pipeline_name')

A method for fetching data for Pipeline table

Parameters
  • pipeline_name – a name

  • target_column_name – default pipeline_name

fetch_pipeline_seed(pipeline_id, seed_id, seed_table, target_column_name=('pipeline_id', 'seed_id', 'seed_table'))

A method for fetching unique pipeline seed using pipeline_id, seed_id and seed_table

Parameters
  • pipeline_id – A pipeline db id

  • seed_id – A seed entry db id

  • seed_table – A seed table name

  • target_column_name – Target set of columns

fetch_pipeline_seed_with_table_data(pipeline_name, table_name='seqrun', status='SEEDED')

A method for fetching linked table records for the seeded entries in pipeseed table

Parameters
  • pipeline_name – A pipeline name

  • table_name – A table name for pipeline_seed lookup, default seqrun

  • status – A text label for seeded status, default is SEEDED

Returns

Two pandas dataframe for pipeline_seed entries and data from other tables

seed_new_experiments(pipeline_name, species_name_list, fastq_type, project_list=None, library_source_list=None, active_status='ACTIVE', autosave=True, seed_table='experiment')

A method for seeding new experiments for primary analysis

Parameters
  • pipeline_name – Name of the analysis pipeline

  • project_list – List of projects to consider for seeding analysis pipeline, default None

  • library_source_list – List of library source to consider for analysis, default None

  • species_name_list – List of sample species to consider for seeding analysis pipeline

  • active_status – Label for active status, default ACTIVE

  • autosave – A toggle for autosaving records in database, default True

  • seed_tabel – Seed table for pipeseed table, default experiment

Returns

A list of available projects for seeding analysis table (if project_list is None) or None and a list of seeded experiments or None

seed_new_seqruns(pipeline_name, autosave=True, seed_table='seqrun')

A method for creating seed for new seqruns

Parameters

pipeline_name – A pipeline name

store_pipeline_data(data, autosave=True)

Load data to Pipeline table

update_pipeline_seed(data, autosave=True, required_columns=('pipeline_id', 'seed_id', 'seed_table', 'status'))

A method for updating the seed status in pipeline_seed table

Parameters

data – dataframe or a hash, should contain following fields * pipeline_name / pipeline_id * seed_id * seed_table * status

Utility functions for database access

Database utility functions

igf_data.utils.dbutils.clean_and_rebuild_database(dbconfig)

A method for deleting data in database and create empty tables

Parameters

dbconfig – A json file containing the database connection info

igf_data.utils.dbutils.read_dbconf_json(dbconfig)

A method for reading dbconfig json file

Parameters

dbconfig – A json file containing the database connection info e.g. {“dbhost”:”DBHOST”,”dbport”: PORT,”dbuser”:”USER”,”dbpass”:”DBPASS”,”dbname”:”DBNAME”,”driver”:”mysql”,”connector”:”pymysql”}

Returns

a dictionary containing dbparms

igf_data.utils.dbutils.read_json_data(data_file)

A method for reading data from json file

Parameters

data_file – A Json format file

Returns

A list of dictionaries

Project adaptor utility functions

igf_data.utils.projectutils.draft_email_for_project_cleanup(template_file, data, draft_output)

A method for drafting email for cleanup

Parameters
  • template_file – A template file

  • data

    A list of dictionary or a dictionary containing the following columns

    • name

    • email_id

    • projects

    • cleanup_date

  • draft_output – A output filename

igf_data.utils.projectutils.find_projects_for_cleanup(dbconfig_file, warning_note_weeks=24, all_warning_note=False)

A function for finding old projects for cleanup

Parameters
  • dbconfig_file – A dbconfig file path

  • warning_note_weeks – Number of weeks from last sequencing run to wait before sending warnings, default 24

  • all_warning_note – A toggle for sending warning notes to all, default False

Returns

A list containing warning lists, a list containing final note list and another list with clean up list

igf_data.utils.projectutils.get_files_and_irods_path_for_project(project_igf_id, db_session_class, irods_path_prefix='/igfZone/home/')

A function for listing all the files and irods dir path for a given project

Parameters
  • project_igf_id – A string containing the project igf id

  • db_session_class – A database session object

  • irods_path_prefix – A string containing irods path prefix, default ‘/igfZone/home/’

Returns

A list containing all the files for a project and a string containing the irods path for the project

igf_data.utils.projectutils.get_project_read_count(project_igf_id, session_class, run_attribute_name='R1_READ_COUNT', active_status='ACTIVE')

A utility method for fetching sample read counts for an input project_igf_id

Parameters
  • project_igf_id – A project_igf_id string

  • session_class – A db session class object

  • run_attribute_name – Attribute name from Run_attribute table for read count lookup

  • active_status – text label for active runs, default ACTIVE

Returns

A pandas dataframe containing following columns

  • project_igf_id

  • sample_igf_id

  • flowcell_id

  • attribute_value

igf_data.utils.projectutils.get_seqrun_info_for_project(project_igf_id, session_class)

A utility method for fetching seqrun_igf_id and flowcell_id which are linked to a specific project_igf_id

Parameters
  • project_igf_id – A project_igf_id string

  • session_class – A db session class object

Returns

A pandas dataframe containing following columns

  • seqrun_igf_id

  • flowcell_id

igf_data.utils.projectutils.mark_project_and_list_files_for_cleanup(project_igf_id, dbconfig_file, outout_dir, force_overwrite=True, use_ephemeral_space=False, irods_path_prefix='/igfZone/home/', withdrawn_tag='WITHDRAWN')

A wrapper function for project cleanup operation

Parameters
  • project_igf_id – A string of project igf -id

  • dbconfig_file – A dbconf json file path

  • outout_dir – Output dir path for dumping file lists for project

  • force_overwrite – Overwrite existing output file, default True

  • use_ephemeral_space – A toggle for temp dir, default False

  • irods_path_prefix – Prefix for irods path, default /igfZone/home/

  • withdrawn_tag – A string tag for marking files in db, default WITHDRAWN

Returns

None

igf_data.utils.projectutils.mark_project_as_withdrawn(project_igf_id, db_session_class, withdrawn_tag='WITHDRAWN')

A function for marking all the entries for a specific project as withdrawn

Parameters
  • project_igf_id – A string containing the project igf id

  • db_session_class – A dbsession object

  • withdrawn_tag – A string for withdrawn field in db, default WITHDRAWN

Returns

None

igf_data.utils.projectutils.mark_project_barcode_check_off(project_igf_id, session_class, barcode_check_attribute='barcode_check', barcode_check_val='OFF')

A utility method for marking project barcode check as off using the project_igf_id

Parameters
  • project_igf_id – A project_igf_id string

  • session_class – A db session class object

  • barcode_check_attribute – A text keyword for barcode check attribute, default barcode_check

  • barcode_check_val – A text for barcode check attribute value, default is ‘OFF’

Returns

None

igf_data.utils.projectutils.notify_project_for_cleanup(warning_template, final_notice_template, cleanup_template, warning_note_list, final_note_list, cleanup_list, use_ephemeral_space=False)

A function for sending emails to users for project cleanup

Parameters
  • warning_template – A email template file for warning

  • final_notice_template – A email template for final notice

  • cleanup_template – A email template for sending cleanup list to igf

  • warning_note_list

    A list of dictionary containing following fields to warn user about cleanup

    • name

    • email_id

    • projects

    • cleanup_date

  • final_note_list – A list of dictionary containing above mentioned fields to noftify user about final cleanup

  • cleanup_list – A list of dictionary containing above mentioned fields to list projects for cleanup

  • use_ephemeral_space – A toggle for using the ephemeral space, default False

igf_data.utils.projectutils.send_email_to_user_via_sendmail(draft_email_file, waiting_time=20, sendmail_exe='sendmail', dry_run=False)

A function for sending email to users via sendmail

Parameters
  • draft_email_file – A draft email to be sent to user

  • waiting_time – Wait after sending the email, default 20sec

  • sendmail_exe – Sendmail exe path, default sendmail

  • dry_run – A toggle for dry run, default False

Sequencing adaptor utility functions

igf_data.utils.seqrunutils.get_seqrun_date_from_igf_id(seqrun_igf_id)

A utility method for fetching sequence run date from the igf id

required params: seqrun_igf_id: A seqrun igf id string

returns a string value of the date

igf_data.utils.seqrunutils.load_new_seqrun_data(data_file, dbconfig)

A method for loading new data for seqrun table

Pipeline adaptor utility functions

igf_data.utils.pipelineutils.find_new_analysis_seeds(dbconfig_path, pipeline_name, project_name_file, species_name_list, fastq_type, library_source_list)

A utils method for finding and seeding new experiments for analysis

Parameters
  • dbconfig_path – A database configuration file

  • slack_config – A slack configuration file

:param pipeline_name:Pipeline name :param fastq_type: Fastq collection type :param project_name_file: A file containing the list of projects for seeding pipeline :param species_name_list: A list of species to consider for seeding analysis :param library_source_list: A list of library source info to consider for seeding analysis :returns: List of available experiments or None and a list of seeded experiments or None

igf_data.utils.pipelineutils.load_new_pipeline_data(data_file, dbconfig)

A method for loading new data for pipeline table

Platform adaptor utility functions

igf_data.utils.platformutils.load_new_flowcell_data(data_file, dbconfig)

A method for loading new data to flowcell table

igf_data.utils.platformutils.load_new_platform_data(data_file, dbconfig)

A method for loading new data for platform table

Pipeline seed adaptor utility functions

igf_data.utils.ehive_utils.pipeseedfactory_utils.get_pipeline_seeds(pipeseed_mode, pipeline_name, igf_session_class, seed_id_label='seed_id', seqrun_date_label='seqrun_date', seqrun_id_label='seqrun_id', experiment_id_label='experiment_id', seqrun_igf_id_label='seqrun_igf_id')

A utils function for fetching pipeline seed information

Parameters
  • pipeseed_mode – A string info about pipeseed mode, allowed values are demultiplexing alignment

  • pipeline_name – A string infor about pipeline name

  • igf_session_class – A database session class for pipeline seed lookup

Returns

Two Pandas dataframes, first with pipeseed entries and second with seed info

IGF pipeline api

Pipeline api

Fetch fastq files for analysis

igf_data.utils.analysis_fastq_fetch_utils.get_fastq_input_list(db_session_class, experiment_igf_id, combine_fastq_dir=False, fastq_collection_type='demultiplexed_fastq', active_status='ACTIVE')

A function for fetching all the fastq files linked to a specific experiment id

Parameters
  • db_session_class – A database session class

  • experiment_igf_id – An experiment igf id

  • fastq_collection_type – Fastq collection type name, default demultiplexed_fastq

  • active_status – text label for active runs, default ACTIVE

  • combine_fastq_dir – Combine fastq file directories for output line, default False

Returns

A list of fastq file or fastq dir paths for the analysis run

Raises

ValueError – It raises ValueError if no fastq directory found

Load analysis result to database and file system

class igf_data.utils.analysis_collection_utils.Analysis_collection_utils(dbsession_class, base_path=None, collection_name=None, collection_type=None, collection_table=None, rename_file=True, add_datestamp=True, tag_name=None, analysis_name=None, allowed_collection=('sample', 'experiment', 'run', 'project'))

A class for dealing with analysis file collection. It has specific method for moving analysis files to a specific directory structure and rename the file using a uniform rule, if required. Example ‘<collection_name>_<analysis_name>_<tag>_<datestamp>.<original_suffix>’

Parameters
  • dbsession_class – A database session class

  • collection_name – Collection name information for file, default None

  • collection_type – Collection type information for file, default None

  • collection_table – Collection table information for file, default None

  • base_path – A base filepath to move file while loading, default ‘None’ for no file move

  • rename_file – Rename file based on collection_table type while loading, default True

  • add_datestamp – Add datestamp while loading the file

  • analysis_name – Analysis name for the file, required for renaming while loading, default None

  • tag_name – Additional tag for filename,default None

  • allowed_collection

    List of allowed collection tables

    sample, experiment, run, project

create_or_update_analysis_collection(file_path, dbsession, withdraw_exisitng_collection=True, autosave_db=True, force=True, remove_file=False)

A method for create or update analysis file collection in db. Required elements will be collected from database if base_path element is given.

Parameters
  • file_path – file path to load as db collection

  • dbsession – An active database session

  • withdraw_exisitng_collection – Remove existing collection group

  • autosave_db – Save changes to database, default True

  • remove_file – A toggle for removing existing file from disk, default False

  • force – Toggle for removing existing file collection, default True

get_new_file_name(input_file, file_suffix=None)

A method for fetching new file name

Parameters
  • input_file – An input filepath

  • file_suffix – A file suffix

load_file_to_disk_and_db(input_file_list, withdraw_exisitng_collection=True, autosave_db=True, file_suffix=None, force=True, remove_file=False)

A method for loading analysis results to disk and database. File will be moved to a new path if base_path is present. Directory structure of the final path is based on the collection_table information.

Following will be the final directory structure if base_path is present

project - base_path/project_igf_id/analysis_name sample - base_path/project_igf_id/sample_igf_id/analysis_name experiment - base_path/project_igf_id/sample_igf_id/experiment_igf_id/analysis_name run - base_path/project_igf_id/sample_igf_id/experiment_igf_id/run_igf_id/analysis_name

Parameters
  • input_file_list – A list of input file to load, all using the same collection info

  • withdraw_exisitng_collection – Remove existing collection group, DO NOT use this while loading a list of files

  • autosave_db – Save changes to database, default True

  • file_suffix – Use a specific file suffix, use None if it should be same as original file e.g. input.vcf.gz to output.vcf.gz

  • force – Toggle for removing existing file, default True

  • remove_file – A toggle for removing existing file from disk, default False

Returns

A list of final filepath

Run metadata validation checks

class igf_data.utils.validation_check.metadata_validation.Validate_project_and_samplesheet_metadata(samplesheet_file, metadata_files, samplesheet_schema, metadata_schema, samplesheet_name='SampleSheet.csv')

A package for running validation checks for project and samplesheet metadata file

Parameters
  • samplesheet_file – A samplesheet input file

  • metadata_files – A list of metadata input file

  • samplesheet_schema – A json schema for samplesheet file validation

  • metadata_schema – A json schema for metadata file validation

static check_metadata_library_by_row(data)

A static method for checking library type metadata per row

Parameters

data – A pandas data series containing sample metadata

Returns

An error message or None

compare_metadata()

A function for comparing samplesheet and metadata files

Returns

A list of error or an empty list

convert_errors_to_gviz(output_json=None)

A method for converting the list of errors to gviz format json

Parameters

output_json – A output json file for saving data, default None

Returns

A gviz json data block for the html output if output_json is None, or else None

dump_error_to_csv(output_csv)

A method for dumping list or errors to a csv file :returns: output csv file path if any errors found, or else None

get_merged_errors()

A method for running the validation checks on input samplesheet metadata and samplesheet files :returns: A list of errors or an empty list

get_metadata_validation_report()

A method for running validation check on input metdata files :returns: A list of errors or an empty list

get_samplesheet_validation_report()

A method for running validation checks on input samplesheet file :returns: A list of errors or an empty list

static validate_metadata_library_type(sample_id, library_source, library_strategy, experiment_type)

A staticmethod for validating library metadata information for sample

Parameters
  • sample_id – Sample name

  • library_source – Library source information

  • library_strategy – Library strategy information

  • experiment_type – Experiment type information

Returns

A error message string or None

Generic utility functions

Basic fasta sequence processing

igf_data.utils.sequtils.rev_comp(input_seq)

A function for converting nucleotide sequence to its reverse complement

Parameters

input_seq – A string of nucleotide sequence

Returns

Reverse complement version of the input sequence

Advanced fastq file processing

igf_data.utils.fastq_utils.compare_fastq_files_read_counts(r1_file, r2_file)

A method for comparing read counts for fastq pairs

Parameters
  • r1_file – Fastq pair R1 file path

  • r2_file – Fastq pair R2 file path

Raises

ValueError if counts are not same

igf_data.utils.fastq_utils.count_fastq_lines(fastq_file)

A method for counting fastq lines

Parameters

fastq_file – A gzipped or unzipped fastq file

Returns

Fastq line count

igf_data.utils.fastq_utils.detect_non_fastq_in_file_list(input_list)

A method for detecting non fastq file within a list of input fastq

Parameters

input_list – A list of filepath to check

Returns

True in non fastq files are present or else False

igf_data.utils.fastq_utils.identify_fastq_pair(input_list, sort_output=True, check_count=False)

A method for fastq read pair identification

Parameters
  • input_list – A list of input fastq files

  • sort_output – Sort output list, default true

  • check_count – Check read count for fastq pair, only available if sort_output is True, default False

Returns

A list for read1 files and another list of read2 files

Process local and remote files

igf_data.utils.fileutils.calculate_file_checksum(filepath, hasher='md5')

A method for file checksum calculation

Parameters
  • filepath – a file path

  • hasher – default is md5, allowed: md5 or sha256

Returns

file checksum value

igf_data.utils.fileutils.check_file_path(file_path)

A function for checking existing filepath

Parameters

file_path – An input filepath for check

Raises

IOError – It raises IOError if file not found

igf_data.utils.fileutils.copy_local_file(source_path, destinationa_path, cd_to_dest=True, force=False)

A method for copy files to local disk

Parameters
  • source_path – A source file path

  • destinationa_path – A destination file path, including the file name ##FIX TYPO

  • cd_to_dest – Change to destination dir before copy, default True

  • force – Optional, set True to overwrite existing destination file, default is False

igf_data.utils.fileutils.copy_remote_file(source_path, destinationa_path, source_address=None, destination_address=None, copy_method='rsync', check_file=True, force_update=False, exclude_pattern_list=None)

A method for copy files from or to remote location

Parameters
  • source_path – A source file path

  • destination_path – A destination file path

  • source_address – Address of the source server

  • destination_address – Address of the destination server

  • copy_method – A nethod for copy files, default is ‘rsync’

  • check_file – Check file after transfer using checksum, default True

  • force_update – Overwrite existing file or dir, default is False

  • exclude_pattern_list – List of file pattern to exclude, Deefault None

igf_data.utils.fileutils.create_file_manifest_for_dir(results_dirpath, output_file, md5_label='md5', size_lavel='size', path_label='file_path', exclude_list=None, force=True)

A method for creating md5 and size list for all the files in a directory path

Parameters
  • results_dirpath – A file path for input file directory

  • output_file – Name of the output csv filepath

  • exclude_list – A list of file pattern to exclude from the archive, default None

  • force – A toggle for replacing output file, if its already present, default True

  • md5_label – A string for checksum column, default md5

  • size_lavel – A string for file size column, default size

  • path_label – A string for file path column, default file_path

Returns

Nill

igf_data.utils.fileutils.get_datestamp_label(datetime_str=None)

A method for fetching datestamp

Parameters

datetime_str – A datetime string to parse, default None

Returns

A padded string of format YYYYMMDD

igf_data.utils.fileutils.get_file_extension(input_file)

A method for extracting file suffix information

Parameters

input_file – A filepath for getting suffix

Returns

A suffix string or an empty string if no suffix found

igf_data.utils.fileutils.get_temp_dir(work_dir=None, prefix='temp', use_ephemeral_space=False)

A function for creating temp directory

Parameters
  • work_dir – A path for work directory, default None

  • prefix – A prefix for directory path, default ‘temp’

  • use_ephemeral_space – Use env variable $EPHEMERAL to get work directory, default False

Returns

A temp_dir

igf_data.utils.fileutils.list_remote_file_or_dirs(remote_server, remote_path, only_dirs=True, only_files=False, user_name=None, user_pass=None)

A method for listing dirs or files on the remote dir paths

Parameters
  • remote_server – Semote servet address

  • remote_path – Path on remote server

  • only_dirs – Toggle for listing only dirs, default True

  • only_files – Toggle for listing only files, default False

  • user_name – User name, default None

  • user_pass – User pass, default None

Returns

A list of dir or file paths

igf_data.utils.fileutils.move_file(source_path, destinationa_path, force=False)

A method for moving files to local disk

Parameters
  • source_path – A source file path

  • destination_path – A destination file path, including the file name

  • force – Optional, set True to overwrite existing destination file, default is False

igf_data.utils.fileutils.prepare_file_archive(results_dirpath, output_file, gzip_output=True, exclude_list=None, force=True)

A method for creating tar.gz archive with the files present in filepath

Parameters
  • results_dirpath – A file path for input file directory

  • output_file – Name of the output archive filepath

  • gzip_output – A toggle for creating gzip output tarfile, default True

  • exclude_list – A list of file pattern to exclude from the archive, default None

  • force – A toggle for replacing output file, if its already present, default True

Returns

None

igf_data.utils.fileutils.preprocess_path_name(input_path)

A method for processing a filepath. It takes a file path or dirpath and returns the same path after removing any whitespace or ascii symbols from the input.

Parameters

path – An input file path or directory path

Returns

A reformatted filepath or dirpath

igf_data.utils.fileutils.remove_dir(dir_path, ignore_errors=True)

A function for removing directory containing files

Parameters
  • dir_path – A directory path

  • ignore_errors – Ignore errors while removing dir, default True

Load files to irods server

class igf_data.utils.igf_irods_client.IGF_irods_uploader(irods_exe_dir, host='eliot.med.ic.ac.uk', zone='/igfZone', port=1247, igf_user='igf', irods_resource='woolfResc')

A simple wrapper for uploading files to irods server from HPC cluster CX1 Please run the following commands in the HPC cluster before running this module Add irods settings to ~/.irods/irods_environment.json > module load irods/4.2.0 > iinit (optional username) Authenticate irods settings using your password The above command will generate a file containing your iRODS password in a ‘scrambled form’

Parameters

irods_exe_dir – A path to the bin directory where icommands are installed

upload_analysis_results_and_create_collection(file_list, irods_user, project_name, analysis_name='default', dir_path_list=None, file_tag=None)

A method for uploading analysis files to irods server

Parameters
  • file_list – A list of file paths to upload to irods

  • irods_user – Irods user name

  • project_name – Name of the project_name

  • analysis_name – A string for analysis name, default is ‘default’

  • dir_path_list – A list of directory structure for irod server, default None for using datestamp

  • file_tag – A text string for adding tag to collection, default None for only project_name

upload_fastqfile_and_create_collection(filepath, irods_user, project_name, run_igf_id, run_date, flowcell_id=None, data_type='fastq')

A method for uploading files to irods server and creating collections with metadata

Parameters
  • filepath – A file for upload to iRODS server

  • irods_user – Recipient user’s irods username

  • project_name – Name of the project. This will be user for collection tag

  • run_igf_id – A unique igf id, either seqrun or run or experiment

  • run_date – A unique run date

  • data_type – A directory label, e.g, fastq, bam or cram

Calculate storage statistics

igf_data.utils.disk_usage_utils.get_storage_stats_in_gb(storage_list)

A utility function for fetching disk usage stats (df -h) and return disk usge in Gb

Parameters

storage_list – a input list of storage path

Returns

A list of dictionary containing following keys

storage used available

igf_data.utils.disk_usage_utils.get_sub_directory_size_in_gb(input_path, dir_name_col='directory_name', dir_size_col='directory_size')

A utility function for listing disk size of all sub-directories for a given path (similar to linux command du -sh /path/* )

Parameters
  • input_path – a input file path

  • dir_name_col – column name for directory name, default directory_name

  • dir_size_col – column name for directory size, default directory size

Returns

  • a list of dictionaries containing following keys

    directory_name directory_size

  • a description dictionary for gviz_api

  • a column order list for gviz _api

igf_data.utils.disk_usage_utils.merge_storage_stats_json(config_file, label_file=None, server_name_col='server_name', storage_col='storage', used_col='used', available_col='available', disk_usage_col='disk_usage')

A utility function for merging multiple disk usage stats file generated by json dump of get_storage_stats_in_gb output

Parameters
  • config_file

    a disk usage status config json file with following keys

    server_name disk_usage

    Each of the disk usage json files should have following keys

    storage used available

  • label_file – an optional json file for renaming the raw disk names format: <raw name> : <print name>

Returns

  • merged data as a list of dictionaries

  • a dictionary containing the description for the gviz_data

  • a list of column order

Run analysis tools

Process fastqc output file

igf_data.utils.fastqc_utils.get_fastq_info_from_fastq_zip(fastqc_zip, fastqc_datafile='*/fastqc_data.txt')

A function for retriving total reads and fastq file name from fastqc_zip file

Parameters
  • fastqc_zip – A zip file containing fastqc results

  • fastqc_datafile – A pattern f

Returns

return total read count and fastq filename

Cellranger count utils

igf_data.utils.tools.cellranger.cellranger_count_utils.check_cellranger_count_output(output_path, file_list=('web_summary.html', 'metrics_summary.csv', 'possorted_genome_bam.bam', 'possorted_genome_bam.bam.bai', 'filtered_feature_bc_matrix.h5', 'raw_feature_bc_matrix.h5', 'molecule_info.h5', 'cloupe.cloupe', 'analysis/tsne/2_components/projection.csv', 'analysis/clustering/graphclust/clusters.csv', 'analysis/diffexp/kmeans_3_clusters/differential_expression.csv', 'analysis/pca/10_components/variance.csv'))

A function for checking cellranger count output

Parameters
  • output_path – A filepath for cellranger count output directory

  • file_list

    List of files to check in the output directory

    default file list to check

    web_summary.html metrics_summary.csv possorted_genome_bam.bam possorted_genome_bam.bam.bai filtered_feature_bc_matrix.h5 raw_feature_bc_matrix.h5 molecule_info.h5 cloupe.cloupe analysis/tsne/2_components/projection.csv analysis/clustering/graphclust/clusters.csv analysis/diffexp/kmeans_3_clusters/differential_expression.csv analysis/pca/10_components/variance.csv

Returns

Nill

Raises

IOError – when any file is missing from the output path

igf_data.utils.tools.cellranger.cellranger_count_utils.extract_cellranger_count_metrics_summary(cellranger_tar, collection_name=None, collection_type=None, attribute_name='attribute_name', attribute_value='attribute_value', attribute_prefix='None', target_filename='metrics_summary.csv')

A function for extracting metrics summary file for cellranger ourput tar and parse the file. Optionally it can add the collection name and type info to the output dictionary.

Parameters
  • cellranger_tar – A cellranger output tar file

  • target_filename – A filename for metrics summary file lookup, default metrics_summary.csv

  • collection_name – Optional collection name, default None

  • collection_type – Optional collection type, default None

  • attribute_tag – An optional string to add as prefix of the attribute names, default None

Returns

A dictionary containing the metrics values

igf_data.utils.tools.cellranger.cellranger_count_utils.get_cellranger_count_input_list(db_session_class, experiment_igf_id, fastq_collection_type='demultiplexed_fastq', active_status='ACTIVE')

A function for fetching input list for cellranger count run for a specific experiment

Parameters
  • db_session_class – A database session class

  • experiment_igf_id – An experiment igf id

  • fastq_collection_type – Fastq collection type name, default demultiplexed_fastq

  • active_status – text label for active runs, default ACTIVE

Returns

A list of fastq dir path for the cellranger count run

Raises

ValueError – It raises ValueError if no fastq directory found

BWA utils

class igf_data.utils.tools.bwa_utils.BWA_util(bwa_exe, samtools_exe, ref_genome, input_fastq_list, output_dir, output_prefix, bam_output=True, thread=1, use_ephemeral_space=0)

Pipeline utils class for running BWA

Parameters
  • bwa_exe – BWA executable path

  • samtools_exe – Samtools executable path

  • ref_genome – Reference genome index for BWA run

  • input_fastq_list – List of input fastq files for alignment

  • output_dir – Output directory path

  • output_prefix – Output prefix for alignment

  • bam_output – A toggle for writing bam output, default True

  • thread – No. of threads for BWA run, default 1

  • use_ephemeral_space – A toggle for temp dir settings, default 0

run_mem(mem_cmd='mem', parameter_options=('-M', ''), samtools_cmd='view', dry_run=False)

A method for running Bwa mem and generate output alignment

Parameters
  • mem_cmd – Bwa mem command, default mem

  • option_list – List of bwa mem option, default -M

  • samtools_cmd – Samtools view command, default view

  • dry_run – A toggle for returning the bwa cmd without running it, default False

Returns

A alignment file path and bwa run cmd

Picard utils

class igf_data.utils.tools.picard_util.Picard_tools(java_exe, picard_jar, input_files, output_dir, ref_fasta, picard_option=None, java_param='-Xmx4g', strand_info='NONE', threads=1, output_prefix=None, use_ephemeral_space=0, ref_flat_file=None, ribisomal_interval=None, patterned_flowcell=False, suported_commands=('CollectAlignmentSummaryMetrics', 'CollectGcBiasMetrics', 'QualityScoreDistribution', 'CollectRnaSeqMetrics', 'CollectBaseDistributionByCycle', 'MarkDuplicates', 'AddOrReplaceReadGroups'))

A class for running picard tool

Parameters
  • java_exe – Java executable path

  • picard_jar – Picard path

  • input_files – Input bam filepaths list

  • output_dir – Output directory filepath

  • ref_fasta – Input reference fasta filepath

  • picard_option – Additional picard run parameters as dictionary, default None

  • java_param – Java parameter, default ‘-Xmx4g’

  • strand_info – RNA-Seq strand information, default NONE

  • ref_flat_file – Input ref_flat file path, default None

  • output_prefix – Output prefix name, default None

  • threads – Number of threads to run for java, default 1

  • use_ephemeral_space – A toggle for temp dir setting, default 0

  • patterned_flowcell – Toggle for marking the patterned flowcell, default False

  • suported_commands

    A list of supported picard commands

    • CollectAlignmentSummaryMetrics

    • CollectGcBiasMetrics

    • QualityScoreDistribution

    • CollectRnaSeqMetrics

    • CollectBaseDistributionByCycle

    • MarkDuplicates

    • AddOrReplaceReadGroups

run_picard_command(command_name, dry_run=False)

A method for running generic picard command

Parameters
  • command_name – Picard command name

  • dry_run – A toggle for returning picard command without the actual run, default False

Returns

A list of output files from picard run and picard run command and optional picard metrics

Fastp utils

class igf_data.utils.tools.fastp_utils.Fastp_utils(fastp_exe, input_fastq_list, output_dir, run_thread=1, enable_polyg_trim=False, split_by_lines_count=5000000, log_output_prefix=None, use_ephemeral_space=0, fastp_options_list=('-a', 'auto', '--qualified_quality_phred=15', '--length_required=15'))

A class for running fastp tool for a list of input fastq files

Parameters
  • fastp_exe – A fastp executable path

  • input_fastq_list – A list of input files

  • output_dir – A output directory path

  • split_by_lines_count – Number of entries for splitted fastq files, default 5000000

  • run_thread – Number of threads to use, default 1

  • enable_polyg_trim – Enable poly G trim for NextSeq and NovaSeq, default False

  • log_output_prefix – Output prefix for log file, default None

  • use_ephemeral_space – A toggle for temp dir, default 0

  • fastp_options_list – A list of options for running fastp, default -a auto –qualified_quality_phred 15 –length_required=15

run_adapter_trimming(split_fastq=False, force_overwrite=True)

A method for running fastp adapter trimming

Parameters

split_fastq – Split fastq output files by line counts, default False

Pram force_overwrite

A toggle for overwriting existing file, default True

Returns

A list for read1 files, list of read2 files and a html report file and the fastp commandline

GATK utils

class igf_data.utils.tools.gatk_utils.GATK_tools(gatk_exe, ref_fasta, use_ephemeral_space=False, java_param='-XX:ParallelGCThreads=1 -Xmx4g')

A python class for running gatk tools

Parameters
  • gatk_exe – Gatk exe path

  • java_param – Java parameter, default ‘-XX:ParallelGCThreads=1 -Xmx4g’

  • ref_fasta – Input reference fasta filepath

  • use_ephemeral_space – A toggle for temp dir settings, default False

run_AnalyzeCovariates(before_report_file, after_report_file, output_pdf_path, force=False, dry_run=False, gatk_param_list=None)

A method for running GATK AnalyzeCovariates tool

Parameters
  • before_report_file – A file containing bqsr output before recalibration

  • after_report_file – A file containing bqsr output after recalibration

  • output_pdf_path – An output pdf filepath

  • force – Overwrite output file, if force is True

  • dry_run – Return GATK command, if its true, default False

  • gatk_param_list – List of additional params for BQSR, default None

Returns

GATK commandline

run_ApplyBQSR(bqsr_recal_file, input_bam, output_bam_path, force=False, dry_run=False, gatk_param_list=None)

A method for running GATK ApplyBQSR

Parameters
  • input_bam – An input bam file

  • bqsr_recal_file – An bqsr table filepath

  • output_bam_path – A bam output file

  • force – Overwrite output file, if force is True

  • dry_run – Return GATK command, if its true, default False

  • gatk_param_list – List of additional params for BQSR, default None

Returns

GATK commandline

run_BaseRecalibrator(input_bam, output_table, known_snp_sites=None, known_indel_sites=None, force=False, dry_run=False, gatk_param_list=None)

A method for running GATK BaseRecalibrator

Parameters
  • input_bam – An input bam file

  • output_table – An output table filepath for recalibration results

  • known_snp_sites – Known snp sites (e.g. dbSNP vcf file), default None

  • known_indel_sites – Known indel sites (e.g.Mills_and_1000G_gold_standard indels vcf), default None

  • force – Overwrite output file, if force is True

  • dry_run – Return GATK command, if its true, default False

  • gatk_param_list – List of additional params for BQSR, default None

Returns

GATK commandline

run_HaplotypeCaller(input_bam, output_vcf_path, dbsnp_vcf, emit_gvcf=True, force=False, dry_run=False, gatk_param_list=None)

A method for running GATK HaplotypeCaller

Parameters
  • input_bam – A input bam file

  • output_vcf_path – A output vcf filepath

  • dbsnp_vcf – A dbsnp vcf file

  • emit_gvcf – A toggle for GVCF generation, default True

  • force – Overwrite output file, if force is True

  • dry_run – Return GATK command, if its true, default False

  • gatk_param_list – List of additional params for BQSR, default None

Returns

GATK commandline

RSEM utils

class igf_data.utils.tools.rsem_utils.RSEM_utils(rsem_exe_dir, reference_rsem, input_bam, threads=1, memory_limit=4000, use_ephemeral_space=0)

A python wrapper for running RSEM tool

Parameters
  • rsem_exe_dir – RSEM executable path

  • reference_rsem – RSEM reference transcriptome path

  • input_bam – Input bam file path for RSEM

  • threads – No. of threads for RSEM run, default 1

  • memory_limit – Memory usage limit for RSEM, default 4Gb

  • use_ephemeral_space – A toggle for temp dir settings, default 0

run_rsem_calculate_expression(output_dir, output_prefix, paired_end=True, strandedness='reverse', options=None, force=True)

A method for running RSEM rsem-calculate-expression tool from alignment file

Parameters
  • output_dir – A output dir path

  • output_prefix – A output file prefix

  • paired_end – A toggle for paired end data, default True

  • strandedness – RNA strand information, default reverse for Illumina TruSeq allowed values are none, forward and reverse

  • options – A dictionary for rsem run, default None

  • force – Overwrite existing data if force is True, default False

Returns

RSEM commandline, output file list and logfile

Samtools utils

igf_data.utils.tools.samtools_utils.convert_bam_to_cram(samtools_exe, bam_file, reference_file, cram_path, threads=1, force=False, dry_run=False, use_ephemeral_space=0)

A function for converting bam files to cram using pysam utility

Parameters
  • samtools_exe – samtools executable path

  • bam_file – A bam filepath with / without index. Index file will be created if its missing

  • reference_file – Reference genome fasta filepath

  • cram_path – A cram output file path

  • threads – Number of threads to use for conversion, default 1

  • force – Output cram will be overwritten if force is True, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

  • use_ephemeral_space – A toggle for temp dir settings, default 0

Returns

Nill

Raises
  • IOError – It raises IOError if no input or reference fasta file found or output file already present and force is not True

  • ValueError – It raises ValueError if bam_file doesn’t have .bam extension or cram_path doesn’t have .cram extension

igf_data.utils.tools.samtools_utils.filter_bam_file(samtools_exe, input_bam, output_bam, samFlagInclude=None, reference_file=None, samFlagExclude=None, threads=1, mapq_threshold=20, cram_out=False, index_output=True, dry_run=False)

A function for filtering bam file using samtools view

Parameters
  • samtools_exe – Samtools path

  • input_bam – Input bamfile path

  • output_bam – Output bamfile path

  • samFlagInclude – Sam flags to keep, default None

  • reference_file – Reference genome fasta filepath

  • samFlagExclude – Sam flags to exclude, default None

  • threads – Number of threads to use, default 1

  • mapq_threshold – Skip alignments with MAPQ smaller than this value, default None

  • index_output – Index output bam, default True

  • cram_out – Output cram file, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

Samtools command

igf_data.utils.tools.samtools_utils.index_bam_or_cram(samtools_exe, input_path, threads=1, dry_run=False)

A method for running samtools index

Parameters
  • samtools_exe – samtools executable path

  • input_path – Alignment filepath

  • threads – Number of threads to use for conversion, default 1

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

samtools cmd list

igf_data.utils.tools.samtools_utils.merge_multiple_bam(samtools_exe, input_bam_list, output_bam_path, sorted_by_name=False, use_ephemeral_space=0, threads=1, force=False, dry_run=False, index_output=True)

A function for merging multiple input bams to a single output bam

Parameters
  • samtools_exe – samtools executable path

  • input_bam_list – A file containing list of bam filepath

  • output_bam_path – A bam output filepath

  • sorted_by_name – Sort bam file by read_name, default False (for coordinate sorted bams)

  • threads – Number of threads to use for merging, default 1

  • force – Output bam file will be overwritten if force is True, default False

  • index_output – Index output bam, default True

  • use_ephemeral_space – A toggle for temp dir settings, default 0

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

samtools command

igf_data.utils.tools.samtools_utils.run_bam_flagstat(samtools_exe, bam_file, output_dir, threads=1, force=False, output_prefix=None, dry_run=False)

A method for generating bam flagstat output

Parameters
  • samtools_exe – samtools executable path

  • bam_file – A bam filepath with / without index. Index file will be created if its missing

  • output_dir – Bam flagstat output directory path

  • output_prefix – Output file prefix, default None

  • threads – Number of threads to use for conversion, default 1

  • force – Output flagstat file will be overwritten if force is True, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

Output file path and a list containing samtools command

igf_data.utils.tools.samtools_utils.run_bam_idxstat(samtools_exe, bam_file, output_dir, output_prefix=None, force=False, dry_run=False)

A function for running samtools index stats generation

Parameters
  • samtools_exe – samtools executable path

  • bam_file – A bam filepath with / without index. Index file will be created if its missing

  • output_dir – Bam idxstats output directory path

  • output_prefix – Output file prefix, default None

  • force – Output idxstats file will be overwritten if force is True, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

Output file path and a list containing samtools command

igf_data.utils.tools.samtools_utils.run_bam_stats(samtools_exe, bam_file, output_dir, threads=1, force=False, output_prefix=None, dry_run=False)

A method for generating samtools stats output

Parameters
  • samtools_exe – samtools executable path

  • bam_file – A bam filepath with / without index. Index file will be created if its missing

  • output_dir – Bam stats output directory path

  • output_prefix – Output file prefix, default None

  • threads – Number of threads to use for conversion, default 1

  • force – Output flagstat file will be overwritten if force is True, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

Output file path, list containing samtools command and a list containing the SN matrics of report

igf_data.utils.tools.samtools_utils.run_samtools_view(samtools_exe, input_file, output_file, reference_file=None, force=True, cram_out=False, threads=1, samtools_params=None, index_output=True, dry_run=False, use_ephemeral_space=0)

A function for running samtools view command

Parameters
  • samtools_exe – samtools executable path

  • input_file – An input bam filepath with / without index. Index file will be created if its missing

  • output_file – An output file path

  • reference_file – Reference genome fasta filepath, default None

  • force – Output file will be overwritten if force is True, default True

  • threads – Number of threads to use for conversion, default 1

  • samtools_params – List of samtools param, default None

  • index_output – Index output file, default True

  • dry_run – A toggle for returning the samtools command without actually running it, default False

  • use_ephemeral_space – A toggle for temp dir settings, default 0

Returns

Samtools command as list

igf_data.utils.tools.samtools_utils.run_sort_bam(samtools_exe, input_bam_path, output_bam_path, sort_by_name=False, use_ephemeral_space=0, threads=1, force=False, dry_run=False, cram_out=False, index_output=True)

A function for sorting input bam file and generate a output bam

Parameters
  • samtools_exe – samtools executable path

  • input_bam_path – A bam filepath

  • output_bam_path – A bam output filepath

  • sort_by_name – Sort bam file by read_name, default False (for coordinate sorting)

  • threads – Number of threads to use for sorting, default 1

  • force – Output bam file will be overwritten if force is True, default False

  • cram_out – Output cram file, default False

  • index_output – Index output bam, default True

  • use_ephemeral_space – A toggle for temp dir settings, default 0

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

None

STAR utils

class igf_data.utils.tools.star_utils.Star_utils(star_exe, input_files, genome_dir, reference_gtf, output_dir, output_prefix, threads=1, use_ephemeral_space=0)

A wrapper python class for running STAR alignment

Parameters
  • star_exe – STAR executable path

  • input_files – List of input files for running alignment

  • genome_dir – STAR reference transcriptome path

  • reference_gtf – Reference GTF file for gene annotation

  • output_dir – Path for output alignment and results

  • output_prefix – File output prefix

  • threads – No. of threads for STAR run, default 1

  • use_ephemeral_space – A toggle for temp dir settings, default 0

generate_aligned_bams(two_pass_mode=True, dry_run=False, star_patameters=('--outFilterMultimapNmax', 20, '--alignSJoverhangMin', 8, '--alignSJDBoverhangMin', 1, '--outFilterMismatchNmax', 999, '--outFilterMismatchNoverReadLmax', 0.04, '--alignIntronMin', 20, '--alignIntronMax', 1000000, '--alignMatesGapMax', 1000000, '--limitBAMsortRAM', 12000000000))

A method running star alignment

Parameters
  • two_pass_mode – Run two-pass mode of star, default True

  • dry_run – A toggle forreturning the star cmd without actual run, default False

  • star_patameters – A dictionary of star parameters, default encode parameters

Returns

A genomic_bam and a transcriptomic bam,log file, gene count file and star commandline

generate_rna_bigwig(bedGraphToBigWig_path, chrom_length_file, stranded=True, dry_run=False)

A method for generating bigWig signal tracks from star aligned bams files

Parameters
  • bedGraphToBigWig_path – bedGraphToBigWig_path executable path

  • chrom_length_file – A file containing chromosome length, e.g. .fai file

:param stranded:Param for stranded analysis, default True :param dry_run: A toggle forreturning the star cmd without actual run, default False :returns: A list of bigWig files and star commandline

Subread utils

igf_data.utils.tools.subread_utils.run_featureCounts(featurecounts_exe, input_gtf, input_bams, output_file, thread=1, use_ephemeral_space=0, options=None)

A wrapper method for running featureCounts tool from subread package

Parameters
  • featurecounts_exe – Path of featureCounts executable

  • input_gtf – Input gtf file path

  • input_bams – input bam files

  • output_file – Output filepath

  • thread – Thread counts, default is 1

  • options – FeaturCcount options, default in None

  • use_ephemeral_space – A toggle for temp dir settings, default 0

Returns

A summary file path and featureCounts command

Reference genome fetch utils

class igf_data.utils.tools.reference_genome_utils.Reference_genome_utils(genome_tag, dbsession_class, genome_fasta_type='GENOME_FASTA', fasta_fai_type='GENOME_FAI', genome_dict_type='GENOME_DICT', gene_gtf_type='GENE_GTF', gene_reflat_type='GENE_REFFLAT', gene_rsem_type='TRANSCRIPTOME_RSEM', bwa_ref_type='GENOME_BWA', minimap2_ref_type='GENOME_MINIMAP2', bowtie2_ref_type='GENOME_BOWTIE2', tenx_ref_type='TRANSCRIPTOME_TENX', star_ref_type='TRANSCRIPTOME_STAR', genome_dbsnp_type='DBSNP_VCF', gatk_snp_ref_type='GATK_SNP_REF', gatk_indel_ref_type='INDEL_LIST_VCF', ribosomal_interval_type='RIBOSOMAL_INTERVAL', blacklist_interval_type='BLACKLIST_BED', genome_twobit_uri_type='GENOME_TWOBIT_URI')

A class for accessing different components of the reference genome for a specific build

get_blacklist_region_bed(check_missing=False)

A method for fetching blacklist interval filepath for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_dbsnp_vcf(check_missing=True)

A method for fetching filepath for dbSNP vcf file, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_gatk_indel_ref(check_missing=True)

A method for fetching filepaths for INDEL files from GATK bundle, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A list of filepaths

get_gatk_snp_ref(check_missing=True)

A method for fetching filepaths for SNP files from GATK bundle, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A list of filepaths

get_gene_gtf(check_missing=True)

A method for fetching reference gene annotation gtf filepath for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_gene_reflat(check_missing=True)

A method for fetching reference gene annotation refflat filepath for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_generic_ref_files(collection_type, check_missing=True)

A method for fetching filepath for generic reference genome file, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string or list (if more than one found)

get_genome_bowtie2(check_missing=True)

A method for fetching filepath of Bowtie2 reference index, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_genome_bwa(check_missing=True)

A method for fetching filepath of BWA reference index, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_genome_dict(check_missing=True)

A method for fetching reference genome dictionary filepath for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_genome_fasta(check_missing=True)

A method for fetching reference genome fasta filepath for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_genome_fasta_fai(check_missing=True)

A method for fetching reference genome fasta fai index filepath for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_genome_minimap2(check_missing=True)

A method for fetching filepath of Minimap2 reference index, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_ribosomal_interval(check_missing=True)

A method for fetching ribosomal interval filepath for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_transcriptome_rsem(check_missing=False)

A method for fetching filepath of RSEM reference transcriptome, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_transcriptome_star(check_missing=False)

A method for fetching filepath of STAR reference transcriptome, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_transcriptome_tenx(check_missing=True)

A method for fetching filepath of 10X Cellranger reference transcriptome, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A filepath string

get_twobit_genome_url(check_missing=True)

A method for fetching filepath for twobit genome url, for a specific genome build

Parameters

check_missing – A toggle for checking errors for missing files, default True

Returns

A url string

Samtools utils

igf_data.utils.tools.samtools_utils.convert_bam_to_cram(samtools_exe, bam_file, reference_file, cram_path, threads=1, force=False, dry_run=False, use_ephemeral_space=0)

A function for converting bam files to cram using pysam utility

Parameters
  • samtools_exe – samtools executable path

  • bam_file – A bam filepath with / without index. Index file will be created if its missing

  • reference_file – Reference genome fasta filepath

  • cram_path – A cram output file path

  • threads – Number of threads to use for conversion, default 1

  • force – Output cram will be overwritten if force is True, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

  • use_ephemeral_space – A toggle for temp dir settings, default 0

Returns

Nill

Raises
  • IOError – It raises IOError if no input or reference fasta file found or output file already present and force is not True

  • ValueError – It raises ValueError if bam_file doesn’t have .bam extension or cram_path doesn’t have .cram extension

igf_data.utils.tools.samtools_utils.filter_bam_file(samtools_exe, input_bam, output_bam, samFlagInclude=None, reference_file=None, samFlagExclude=None, threads=1, mapq_threshold=20, cram_out=False, index_output=True, dry_run=False)

A function for filtering bam file using samtools view

Parameters
  • samtools_exe – Samtools path

  • input_bam – Input bamfile path

  • output_bam – Output bamfile path

  • samFlagInclude – Sam flags to keep, default None

  • reference_file – Reference genome fasta filepath

  • samFlagExclude – Sam flags to exclude, default None

  • threads – Number of threads to use, default 1

  • mapq_threshold – Skip alignments with MAPQ smaller than this value, default None

  • index_output – Index output bam, default True

  • cram_out – Output cram file, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

Samtools command

igf_data.utils.tools.samtools_utils.index_bam_or_cram(samtools_exe, input_path, threads=1, dry_run=False)

A method for running samtools index

Parameters
  • samtools_exe – samtools executable path

  • input_path – Alignment filepath

  • threads – Number of threads to use for conversion, default 1

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

samtools cmd list

igf_data.utils.tools.samtools_utils.merge_multiple_bam(samtools_exe, input_bam_list, output_bam_path, sorted_by_name=False, use_ephemeral_space=0, threads=1, force=False, dry_run=False, index_output=True)

A function for merging multiple input bams to a single output bam

Parameters
  • samtools_exe – samtools executable path

  • input_bam_list – A file containing list of bam filepath

  • output_bam_path – A bam output filepath

  • sorted_by_name – Sort bam file by read_name, default False (for coordinate sorted bams)

  • threads – Number of threads to use for merging, default 1

  • force – Output bam file will be overwritten if force is True, default False

  • index_output – Index output bam, default True

  • use_ephemeral_space – A toggle for temp dir settings, default 0

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

samtools command

igf_data.utils.tools.samtools_utils.run_bam_flagstat(samtools_exe, bam_file, output_dir, threads=1, force=False, output_prefix=None, dry_run=False)

A method for generating bam flagstat output

Parameters
  • samtools_exe – samtools executable path

  • bam_file – A bam filepath with / without index. Index file will be created if its missing

  • output_dir – Bam flagstat output directory path

  • output_prefix – Output file prefix, default None

  • threads – Number of threads to use for conversion, default 1

  • force – Output flagstat file will be overwritten if force is True, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

Output file path and a list containing samtools command

igf_data.utils.tools.samtools_utils.run_bam_idxstat(samtools_exe, bam_file, output_dir, output_prefix=None, force=False, dry_run=False)

A function for running samtools index stats generation

Parameters
  • samtools_exe – samtools executable path

  • bam_file – A bam filepath with / without index. Index file will be created if its missing

  • output_dir – Bam idxstats output directory path

  • output_prefix – Output file prefix, default None

  • force – Output idxstats file will be overwritten if force is True, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

Output file path and a list containing samtools command

igf_data.utils.tools.samtools_utils.run_bam_stats(samtools_exe, bam_file, output_dir, threads=1, force=False, output_prefix=None, dry_run=False)

A method for generating samtools stats output

Parameters
  • samtools_exe – samtools executable path

  • bam_file – A bam filepath with / without index. Index file will be created if its missing

  • output_dir – Bam stats output directory path

  • output_prefix – Output file prefix, default None

  • threads – Number of threads to use for conversion, default 1

  • force – Output flagstat file will be overwritten if force is True, default False

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

Output file path, list containing samtools command and a list containing the SN matrics of report

igf_data.utils.tools.samtools_utils.run_samtools_view(samtools_exe, input_file, output_file, reference_file=None, force=True, cram_out=False, threads=1, samtools_params=None, index_output=True, dry_run=False, use_ephemeral_space=0)

A function for running samtools view command

Parameters
  • samtools_exe – samtools executable path

  • input_file – An input bam filepath with / without index. Index file will be created if its missing

  • output_file – An output file path

  • reference_file – Reference genome fasta filepath, default None

  • force – Output file will be overwritten if force is True, default True

  • threads – Number of threads to use for conversion, default 1

  • samtools_params – List of samtools param, default None

  • index_output – Index output file, default True

  • dry_run – A toggle for returning the samtools command without actually running it, default False

  • use_ephemeral_space – A toggle for temp dir settings, default 0

Returns

Samtools command as list

igf_data.utils.tools.samtools_utils.run_sort_bam(samtools_exe, input_bam_path, output_bam_path, sort_by_name=False, use_ephemeral_space=0, threads=1, force=False, dry_run=False, cram_out=False, index_output=True)

A function for sorting input bam file and generate a output bam

Parameters
  • samtools_exe – samtools executable path

  • input_bam_path – A bam filepath

  • output_bam_path – A bam output filepath

  • sort_by_name – Sort bam file by read_name, default False (for coordinate sorting)

  • threads – Number of threads to use for sorting, default 1

  • force – Output bam file will be overwritten if force is True, default False

  • cram_out – Output cram file, default False

  • index_output – Index output bam, default True

  • use_ephemeral_space – A toggle for temp dir settings, default 0

  • dry_run – A toggle for returning the samtools command without actually running it, default False

Returns

None

Scanpy utils

Metadata processing

Register metadata for new projects

class igf_data.process.seqrun_processing.find_and_register_new_project_data.Find_and_register_new_project_data(projet_info_path, dbconfig, user_account_template, log_slack=True, slack_config=None, check_hpc_user=False, hpc_user=None, hpc_address=None, ldap_server=None, setup_irods=True, notify_user=True, default_user_email='igf@imperial.ac.uk', project_lookup_column='project_igf_id', user_lookup_column='email_id', data_authority_column='data_authority', sample_lookup_column='sample_igf_id', barcode_check_keyword='barcode_check', metadata_sheet_name='Project metadata', sendmail_exe='/usr/sbin/sendmail')

A class for finding new data for project and registering them to the db. Account for new users will be created in irods server and password will be mailed to them.

Parameters
  • projet_info_path – A directory path for project info files

  • dbconfig – A json dbconfig file

  • check_hpc_user – Guess the hpc user name, True or False, default: False

  • hpc_user – A hpc user name, default is None

  • hpc_address – A hpc host address, default is None

  • ldap_server – A ldap server address for search, default is None

  • user_account_template – A template file for user account activation email

  • log_slack – Enable or disable sending message to slack, default: True

  • slack_config – A slack config json file, required if log_slack is True

  • project_lookup_column – project data lookup column, default project_igf_id

  • user_lookup_column – user data lookup column, default email_id

  • sample_lookup_column – sample data lookup column, default sample_igf_id

  • data_authority_column – data authority column name, default data_authority

  • setup_irods – Setup irods account for user, default is True

  • notify_user – Send email notification to user, default is True

  • default_user_email – Add another user as the default collaborator for all new projects, default igf@imperial.ac.uk

  • barcode_check_keyword – Project attribute name for barcode check settings, default barcode_check

  • sendmail_exe – Sendmail executable path, default /usr/sbin/sendmail

process_project_data_and_account()

A method for finding new project info and registering them to database and user account creation

Update experiment metadata from sample attributes

class igf_data.process.metadata.experiment_metadata_updator.Experiment_metadata_updator(dbconfig_file, log_slack=True, slack_config=None)

A class for updating metadata for experiment table in database

update_metadta_from_sample_attribute(experiment_igf_id=None, sample_attribute_names=('library_source', 'library_strategy', 'experiment_type'))

A method for fetching experiment metadata from sample_attribute tables :param experiment_igf_id: An experiment igf id for updating only a selected experiment, default None for all experiments :param sample_attribute_names: A list of sample attribute names to look for experiment metadata,

default: library_source, library_strategy, experiment_type

Sequencing run

Process samplesheet file

class igf_data.illumina.samplesheet.SampleSheet(infile, data_header_name='Data')

A class for processing SampleSheet files for Illumina sequencing runs

Parameters
  • infile – A samplesheet file

  • data_header_name – name of the data section, default Data

add_pseudo_lane_for_miseq(lane='1')

A method for adding pseudo lane information for the nextseq platform

Parameters

lane – A lane id for pseudo lane value

add_pseudo_lane_for_nextseq(lanes=('1', '2', '3', '4'))

A method for adding pseudo lane information for the nextseq platform

Parameters

lanes – A list of pseudo lanes, default [‘1’,’2’,’3’,’4’]

:returns:None

check_sample_header(section, condition_key)

Function for checking SampleSheet header

Parameters
  • section – A field name for header info check

  • condition_key – A condition key for header info check

Returns

zero if its not present or number of occurrence of the term

filter_sample_data(condition_key, condition_value, method='include', lane_header='Lane', lane_default_val='1')

Function for filtering SampleSheet data based on matching condition

Parameters
  • condition_key – A samplesheet column name

  • condition_value – A keyword present in the selected column

  • method – ‘include’ or ‘exclude’ for adding or removing selected column from the samplesheet default is include

get_index_count()

A function for getting index length counts

Returns

A dictionary, with the index columns as the key

get_indexes()

A method for retrieving the indexes from the samplesheet

Returns

A list of index barcodes

get_lane_count(lane_field='Lane', target_platform='HiSeq')

Function for getting the lane information for HiSeq runs It will return 1 for both MiSeq and NextSeq runs

Parameters
  • lane_field – Column name for lane info, default ‘Lane’

  • target_platform – Hiseq platform tag, default ‘HiSeq’

Returns

A list of lanes present in samplesheet file

get_platform_name(section='Header', field='Application')

Function for getting platform details from samplesheet header

Parameters
  • section – File section for lookup, default ‘Header’

  • field – Field name for platform info, default ‘Application’

get_project_and_lane(project_tag='Sample_Project', lane_tag='Lane')

A method for fetching project and lane information from samplesheet

Parameters
  • project_tag – A string for project name column in the samplesheet, default Sample_Project

  • lane_tag – A string for Lane id column in the samplesheet, default Lane

Returns

A list of project name (for all) and lane information (only for hiseq)

get_project_names(tag='sample_project')

Function for retrieving unique project names from samplesheet. If there are multiple matching headers, the first column will be used

Parameters

tag – Name of tag for project lookup, default sample_project

Returns

A list of unique project name

get_reverse_complement_index(index_field='index2')

A function for changing the I5_index present in the index2 field of the samplesheet to intsreverse complement base

Parameters

index_field – Column name for index 2, default index2

group_data_by_index_length()

Function for grouping samplesheet rows based on the combined length of index columns By default, this function removes Ns from the index

Returns

A dictionary of samplesheet objects, with combined index length as the key

modify_sample_header(section, type, condition_key, condition_value='')

Function for modifying SampleSheet header

Parameters
  • section – A field name for header info check

  • condition_key – A condition key for header info check

  • type – Mode type, ‘add’ or ‘remove’

  • condition_value – Its is required for ‘add’ type

print_sampleSheet(outfile)

Function for printing output SampleSheet

Parameters

outfile – A output samplesheet path

validate_samplesheet_data(schema_json)

A method for validation of samplesheet data

Parameters

schema – A JSON schema for validation of the samplesheet data

:return a list of error messages or an empty list if no error found

Fetch read cycle info from RunInfo.xml file

class igf_data.illumina.runinfo_xml.RunInfo_xml(xml_file)

A class for reading runinfo xml file from illumina sequencing runs

Parameters

xml_file – A runinfo xml file

get_flowcell_name()

A mthod for accessing flowcell name from the runinfo xml file

get_platform_number()

Function for fetching the instrument series number

get_reads_stats(root_tag='read', number_tag='number', tags=('isindexedread', 'numcycles'))

A method for getting read and index stats from the RunInfo.xml file

Parameters
  • root_tag – Root tag for xml file, default read

  • number_tag – Number tag for xml file, default number

  • tags – List of tags for xml lookup, default [‘isindexedread’,’numcycles’]

Returns

A dictionary with the read number as the key

Fetch flowcell info from runparameters xml file

class igf_data.illumina.runparameters_xml.RunParameter_xml(xml_file)

A class for reading runparameters xml file from Illumina sequencing runs

Parameters

xml_file – A runparameters xml file

get_hiseq_flowcell()

A method for fetching flowcell details for hiseq run

Returns

Flowcell info or None (for MiSeq and NextSeq runs)

Find and process new sequencing run for demultiplexing

igf_data.process.seqrun_processing.find_and_process_new_seqrun.calculate_file_md5(seqrun_info, md5_out, seqrun_path, file_suffix='md5.json', exclude_dir=())

A method for file md5 calculation for all the sequencing run files

Parameters
  • seqrun_info – A dictionary containing sequencing run information

  • md5_out – JSON md5 file output directory

  • file_suffix – Suffix information for new JSON md5 files, default: md5.json

  • exclude_dir – A list of directories to exclude from the file look up

Returns

Output is a dictionary of json files

{seqrun_name: seqrun_md5_list_path} Format of the json file [{“seqrun_file_name”:”file_path”,”file_md5”:”md5_value”}]

igf_data.process.seqrun_processing.find_and_process_new_seqrun.check_finished_seqrun_dir(seqrun_dir, seqrun_path, required_files=('RTAComplete.txt', 'SampleSheet.csv', 'RunInfo.xml'))

A method for checking complete sequencing run directory

Parameters
  • seqrun_dir – A list of sequencing run names

  • seqrun_path – A directory path for new sequencing run look up

  • required_files – A list of files to check before marking sequencing run as complete, default: ‘RTAComplete.txt’,’SampleSheet.csv’,’RunInfo.xml’

Returns

A dictionary containing valid sequencing run information

igf_data.process.seqrun_processing.find_and_process_new_seqrun.check_for_registered_project_and_sample(seqrun_info, dbconfig, samplesheet_file='SampleSheet.csv')

A method for fetching project and sample records from samplesheet and checking for registered samples in db

Parameters
  • seqrun_info – A dictionary containing seqrun name and path as key and values

  • dbconfig – A database configuration file

  • samplesheet_file – Name of samplesheet file, default is SampleSheet.csv

Returns

A dictionary containing the new run information A string message containing database checking information

igf_data.process.seqrun_processing.find_and_process_new_seqrun.check_seqrun_dir_in_db(all_seqrun_dir, dbconfig)

A method for checking existing seqrun dirs in database

Parameters
  • all_seqrun_dir – list of seqrun dirs to check

  • dbconfig – dbconfig

Returns

A list containing new sequencing run information

igf_data.process.seqrun_processing.find_and_process_new_seqrun.find_new_seqrun_dir(path, dbconfig)

A method for check and finding new sequencing run directory

Parameters
  • path – A directory path for new sequencing run lookup

  • dbconfig – A database configuration file

Returns

A list of new sequencing run names for processing

igf_data.process.seqrun_processing.find_and_process_new_seqrun.load_seqrun_files_to_db(seqrun_info, seqrun_md5_info, dbconfig, file_type='ILLUMINA_BCL_MD5')

A method for loading md5 lists to collection and files table

Parameters
  • seqrun_info – A dictionary containing the sequencing run information

  • seqrun_md5_info – A dictionary containing the sequencing run JSON md5 file info

  • dbconfig – A database configuration file

  • file_type – A collection type information for loading the JSON files to database

Returns

Nill

igf_data.process.seqrun_processing.find_and_process_new_seqrun.prepare_seqrun_for_db(seqrun_name, seqrun_path, session_class)

A method for preparing seqrun data for database

Parameters
  • seqrun_name – A sequencing run name

  • seqrun_path – A directory path for sequencing run look up

  • session_class – A database session class

Returns

A dictionary containing information to populate the seqrun table in database

igf_data.process.seqrun_processing.find_and_process_new_seqrun.seed_pipeline_table_for_new_seqrun(pipeline_name, dbconfig)

A method for seeding pipelines for the new seqruns

Parameters
  • pipeline_name – A pipeline name

  • dbconfig – A dbconfig file

Returns

Nill

igf_data.process.seqrun_processing.find_and_process_new_seqrun.validate_samplesheet_for_seqrun(seqrun_info, schema_json, output_dir, samplesheet_file='SampleSheet.csv')

A method for validating samplesheet and writing errors to a report file

Parameters
  • seqrun_info – A dictionary containing seqrun name and path as key and values

  • schema_json – A json schema for samplesheet validation

  • output_dir – A directory path for writing output report files

  • samplesheet_file – Samplesheet filename, default ‘SampleSheet.csv’

Returns

new_seqrun_info, A new dictionary containing seqrun name and path as key and values

Returns

error_file_list, A dictionary containing seqrun name and error file paths as key and values

Demultiplexing

Bases mask calculation

class igf_data.illumina.basesMask.BasesMask(samplesheet_file, runinfo_file, read_offset, index_offset)

A class for bases mask value calculation for demultiplexing of sequencing runs

Parameters
  • samplesheet_file – A samplesheet file containing sample index barcodes

  • runinfo_file – A runinfo xml file from sequencing run

  • read_offset – Read offset value in bp

  • index_offset – Index offset value in bp

calculate_bases_mask(numcycle_label='numcycles', isindexedread_label='isindexedread')

A method for bases mask value calculation

Parameters
  • numcycle_label – Cycle label in runinfo xml file, default numcycles

  • isindexedread_label – Index cycle label in runinfo xml file, default isindexedread

Returns

A formatted bases mask value for bcl2fastq run

Copy bcl files for demultiplexing

Collect demultiplexed fastq files to database

class igf_data.process.seqrun_processing.collect_seqrun_fastq_to_db.Collect_seqrun_fastq_to_db(fastq_dir, model_name, seqrun_igf_id, session_class, flowcell_id, samplesheet_file=None, samplesheet_filename='SampleSheet.csv', collection_type='demultiplexed_fastq', file_location='HPC_PROJECT', collection_table='run', manifest_name='file_manifest.csv', singlecell_tag='10X')

A class for collecting raw fastq files after demultiplexing and storing them in database. Additionally this will also create relevant entries for the experiment and run tables in database

Parameters
  • fastq_dir – A directory path for file look up

  • model_name – Sequencing platform information

  • seqrun_igf_id – Sequencing run name

  • session_class – A database session class

  • flowcell_id – Flowcell information for the run

  • samplesheet_file – Samplesheet filepath

  • samplesheet_filename – Name of the samplesheet file, default SampleSheet.csv

  • collection_type – Collection type information for new fastq files, default demultiplexed_fastq

  • file_location – Fastq file location information, default HPC_PROJECT

  • collection_table – Collection table information for fastq files, default run

  • manifest_name – Name of the file manifest file, default file_manifest.csv

  • singlecell_tag – Samplesheet description for singlecell samples, default 10X

find_fastq_and_build_db_collection()

A method for finding fastq files and samplesheet under a run directory and loading the new files to db with their experiment and run information

It calculates following entries

  • library_name

    Same as sample_id unless mentioned in ‘Description’ field of samplesheet

  • experiment_igf_id

    library_name combined with the platform name same library sequenced in different platform will be added as separate experiemnt

  • run_igf_id

    experiment_igf_id combined with sequencing flowcell_id and lane_id collection name: Same as run_igf_id, fastq files will be added to db collection using this id

  • collection type

    Default type for fastq file collections are ‘demultiplexed_fastq’

  • file_location

    Default value is ‘HPC_PROJECT’

Check demultiplexing barcode stats

Pipeline control

Reset pipeline seeds for re-processing

class igf_data.process.pipeline.modify_pipeline_seed.Modify_pipeline_seed(igf_id_list, table_name, pipeline_name, dbconfig_file, log_slack=True, log_asana=True, slack_config=None, asana_project_id=None, asana_config=None, clean_up=True)

A class for changing pipeline run status in the pipeline_seed table

reset_pipeline_seed_for_rerun(seeded_label='SEEDED', restricted_status_list=('SEEDED', 'RUNNING'))

A method for setting the pipeline for re-run if the first run has failed or aborted This method will set the pipeline_seed.status as ‘SEEDED’ only if its not already ‘SEEDED’ or ‘RUNNING’ :param seeded_label: A text label for seeded status, default SEEDED :param restricted_status_list: A list of pipeline status to exclude from the search,

default [‘SEEDED’,’RUNNING’]

Reset samplesheet files after modification for rerunning pipeline

class igf_data.process.seqrun_processing.reset_samplesheet_md5.Reset_samplesheet_md5(seqrun_path, seqrun_igf_list, dbconfig_file, clean_up=True, json_collection_type='ILLUMINA_BCL_MD5', log_slack=True, log_asana=True, slack_config=None, asana_project_id=None, asana_config=None, samplesheet_name='SampleSheet.csv')

A class for modifying samplesheet md5 for seqrun data processing

run()

A method for resetting md5 values in the samplesheet json files for all seqrun ids

Demultiplexing of single cell sample

Modify samplesheet for singlecell samples

class igf_data.process.singlecell_seqrun.processsinglecellsamplesheet.ProcessSingleCellSamplesheet(samplesheet_file, singlecell_barcode_json, singlecell_tag='10X', index_column='index', sample_id_column='Sample_ID', sample_name_column='Sample_Name', orig_sample_id='Original_Sample_ID', orig_sample_name='Original_Sample_Name', sample_description_column='Description', orig_index='Original_index')

A class for processing samplesheet containing single cell (10X) index barcodes It requires a json format file listing all the single cell barcodes downloaded from this page https://support.10xgenomics.com/single-cell-gene-expression/sequencing/doc/ specifications-sample-index-sets-for-single-cell-3

required params: samplesheet_file: A samplesheet containing single cell samples singlecell_barcode_json: A JSON file listing single cell indexes singlecell_tag: A text keyword for the single cell sample description index_column: Column name for index lookup, default ‘index’ sample_id_column: Column name for sample_id lookup, default ‘Sample_ID’ sample_name_column: Column name for sample_name lookup, default ‘Sample_NAme’ orig_sample_id: Column name for keeping original sample ids, default ‘Original_Sample_ID’ orig_sample_name: Column name for keeping original sample_names, default: ‘Original_Sample_Name’ orig_index: Column name for keeping original index, default ‘Original_index’

change_singlecell_barcodes(output_samplesheet)

A method for replacing single cell index codes present in the samplesheet with the four index sequences. This method will create 4 samplesheet entries for each of the single cell samples with _1 to _4 suffix and relevant indexes

required params: output_samplesheet: A file name of the output samplesheet

Merge fastq files for single cell samples

class igf_data.process.singlecell_seqrun.mergesinglecellfastq.MergeSingleCellFastq(fastq_dir, samplesheet, platform_name, singlecell_tag='10X', sampleid_col='Sample_ID', samplename_col='Sample_Name', use_ephemeral_space=0, orig_sampleid_col='Original_Sample_ID', description_col='Description', orig_samplename_col='Original_Sample_Name', project_col='Sample_Project', lane_col='Lane', pseudo_lane_col='PseudoLane', force_overwrite=True)

A class for merging single cell fastq files per lane per sample

Parameters
  • fastq_dir – A directory path containing fastq files

  • samplesheet – A samplesheet file used demultiplexing of bcl files

  • platform_name – A sequencing platform name

  • singlecell_tag – A single cell keyword for description field, default ‘10X’

  • sampleid_col – A keyword for sample id column of samplesheet, default ‘Sample_ID’

  • samplename_col – A keyword for sample name column of samplesheet, default ‘Sample_Name’

  • orig_sampleid_col – A keyword for original sample id column, default ‘Original_Sample_ID’

  • orig_samplename_col – A keyword for original sample name column, default ‘Original_Sample_Name’

  • description_col – A keyword for description column, default ‘Description’

  • project_col – A keyword for project column, default ‘Sample_Project’

  • pseudo_lane_col – A keyword for pseudo lane column, default ‘PseudoLane’

  • lane_col – A keyword for lane column, default ‘Lane’

  • force_overwrite – A toggle for overwriting output fastqs, default True

SampleSheet file should contain following columns:
  • Sample_ID: A single cell sample id in the following format, SampleId_{digit}

  • Sample_Name: A single cell sample name in the following format, SampleName_{digit}

  • Original_Sample_ID: An IGF sample id

  • Original_Sample_Name: A sample name provided by user

  • Description: A single cell label, default 10X

merge_fastq_per_lane_per_sample()

A method for merging single cell fastq files present in input fastq_dir per lane per sample basis

Report page building

Configure Biodalliance genome browser for qc page

class igf_data.utils.config_genome_browser.Config_genome_browser(dbsession_class, project_igf_id, collection_type_list, pipeline_name, collection_table, species_name, ref_genome_type, track_file_type=None, analysis_path_prefix='analysis', use_ephemeral_space=0, analysis_dir_structure_list=('sample_igf_id', ))

A class for configuring genome browser input files for analysis track visualization

Parameters
  • dbsession_class – A database session class

  • project_igf_id – A project igf id

  • collection_type_list – A list of collection types to include in the track

  • pipeline_name – Name of the analysis pipeline for status checking

  • collection_table – Name of file collection table name

  • species_name – Species name for ref genome fetching

  • ref_genome_type – Reference genome type for remote tracks

  • track_file_type – Additional track file collection types

  • analysis_path_prefix – Top level dir name for analysis files, default ‘analysis’

  • use_ephemeral_space – A toggle for temp dir settings, default 0

  • analysis_dir_structure_list – List of keywords for sub directory paths, default [‘sample_igf_id’]

build_biodalliance_config(template_file, output_file)

A method for building biodalliance specific config file :param template_file: A template file path :param output_file: An output filepath

Process Google chart json data

igf_data.utils.gviz_utils.convert_to_gviz_json_for_display(description, data, columns_order, output_file=None)

A utility method for writing gviz format json file for data display using Google charts

:param description, A dictionary for the data table description :param data, A dictionary containing the data table :column_order, A tuple of data table column order :param output_file, Output filename, default None :returns: None if output_file name is present, or else json_data string

Generate data for QC project page

igf_data.utils.project_data_display_utils.add_seqrun_path_info(input_data, output_file, seqrun_col='seqrun_igf_id', flowcell_col='flowcell_id', path_col='path')

A utility method for adding remote path to a dataframe for each sequencing runs of a project

required params: :param input_data, A input dataframe containing the following columns

seqrun_igf_id flowcell_id

:param seqrun_col, Column name for sequencing run id, default seqrun_igf_id :param flowcell_col, Column namae for flowcell id, default flowcell_id :param path_col, Column name for path, default path output_file: An output filepath for the json data

igf_data.utils.project_data_display_utils.convert_project_data_gviz_data(input_data, sample_col='sample_igf_id', read_count_col='attribute_value', seqrun_col='flowcell_id')

A utility method for converting project’s data availability information to gviz data table format https://developers.google.com/chart/interactive/docs/reference#DataTable

required params: :param input_data: A pandas data frame, it should contain following columns

sample_igf_id, flowcell_id, attribute_value (R1_READ_COUNT)

:param sample_col, Column name for sample id, default sample_igf_id :param seqrun_col, Column name for sequencing run identifier, default flowcell_id :param read_count_col, Column name for sample read counts, default attribute_value

return

a dictionary of description a list of data dictionary a tuple of column_order

Generate data for QC status page

class igf_data.utils.project_status_utils.Project_status(igf_session_class, project_igf_id, seqrun_work_day=2, analysis_work_day=1, sequencing_resource_name='Sequencing', demultiplexing_resource_name='Demultiplexing', analysis_resource_name='Primary Analysis', task_id_label='task_id', task_name_label='task_name', resource_label='resource', dependencies_label='dependencies', start_date_label='start_date', end_date_label='end_date', duration_label='duration', percent_complete_label='percent_complete')

A class for project status fetch and gviz json file generation for Google chart grantt plot

Parameters
  • igf_session_class – Database session class

  • project_igf_id – Project igf id for database lookup

  • seqrun_work_day – Duration for seqrun jobs in days, default 2

  • analysis_work_day – Duration for analysis jobs in days, default 1

  • sequencing_resource_name – Resource name for sequencing data, default Sequencing

  • demultiplexing_resource_name – Resource name for demultiplexing data,default Demultiplexing

  • analysis_resource_name – Resource name for analysis data, default Primary Analysis

  • task_id_label – Label for task id field, default task_id

  • task_name_label – Label for task name field, default task_name

  • resource_label – Label for resource field, default resource

  • start_date_label – Label for start date field, default start_date

  • end_date_label – Label for end date field, default end_date

  • duration_label – Label for duration field, default duration

  • percent_complete_label – Label for percent complete field, default percent_complete

  • dependencies_label – Label for dependencies field, default dependencies

generate_gviz_json_file(output_file, demultiplexing_pipeline, analysis_pipeline, active_seqrun_igf_id=None)

A wrapper method for writing a gviz json file with project status information

Parameters
  • output_file – A filepath for writing project status

  • analysis_pipeline – Name of the analysis pipeline

  • demultiplexing_pipeline – Name of the demultiplexing pipeline

  • analysis_pipeline – name of the analysis pipeline

  • active_seqrun_igf_id – Igf id go the active seqrun, default None

Returns

None

get_analysis_info(analysis_pipeline)

A method for fetching all active experiments and their run status for a project

Parameters

analysis_pipeline – Name of the analysis pipeline

Returns

A list of dictionary containing the analysis information

get_seqrun_info(active_seqrun_igf_id=None, demultiplexing_pipeline=None)

A method for fetching all active sequencing runs for a project

Parameters
  • active_seqrun_igf_id – Seqrun igf id for the current run, default None

  • demultiplexing_pipeline – Name of the demultiplexing pipeline, default None

Returns

A dictionary containing seqrun information

static get_status_column_order()

A method for fetching column order for status json data

Returns

A list data containing the column order

static get_status_description()

A method for getting description for status json data

Returns

A dictionary containing status info

Generate data for QC analysis page

class igf_data.utils.project_analysis_utils.Project_analysis(igf_session_class, collection_type_list, remote_analysis_dir='analysis', use_ephemeral_space=0, attribute_collection_file_type='ANALYSIS_CRAM', pipeline_name='PrimaryAnalysisCombined', pipeline_seed_table='experiment', pipeline_finished_status='FINISHED', sample_id_label='SAMPLE_ID')

A class for fetching all the analysis files linked to a project

Parameters
  • igf_session_class – A database session class

  • collection_type_list – A list of collection type for database lookup

  • remote_analysis_dir – A remote path prefix for analysis file look up, default analysis

  • attribute_collection_file_type – A filetype list for fetching collection attribute records, default (‘ANALYSIS_CRAM’)

get_analysis_data_for_project(project_igf_id, output_file, chart_json_output_file=None, csv_output_file=None, gviz_out=True, file_path_column='file_path', type_column='type', sample_igf_id_column='sample_igf_id')

A method for fetching all the analysis files for a project

Parameters
  • project_igf_id – A project igf id for database lookup

  • output_file – An output filepath, either a csv or a gviz json

  • gviz_out – A toggle for converting output to gviz output, default is True

  • sample_igf_id_column – A column name for sample igf id, default sample_igf_id

  • file_path_column – A column name for file path, default file_path

  • type_column – A column name for collection type, default type