cellmaps_imagedownloader package

cellmaps_imagedownloader.runner module

class cellmaps_imagedownloader.runner.CM4AICopyDownloader[source]

Bases: ImageDownloader

Copies over images from CM4AI RO-Crate

Constructor

download_images(download_list=None)[source]

Subclasses should implement

Parameters:

download_list (list) – list of tuples where first element is full URL of image to download and 2nd element is destination path

Returns:

class cellmaps_imagedownloader.runner.CellmapsImageDownloader(outdir=None, imgsuffix='.jpg', imagedownloader=<cellmaps_imagedownloader.runner.MultiProcessImageDownloader object>, imagegen=None, imageurlgen=None, skip_logging=True, provenance=None, input_data_dict=None, provenance_utils=<cellmaps_utils.provenance.ProvenanceUtil object>, skip_failed=False, existing_outdir=False)[source]

Bases: object

Downloads Immunofluorescent images from Human Protein Atlas storing them in an output directory that is locally registered as an RO-Crate

Constructor

Parameters:
  • outdir (str) – directory where images will be downloaded to

  • imgsuffix (str) – suffix to append to image file names

  • imagedownloader (ImageDownloader) – object that will perform image downloads

  • imagegen (ImageGeneNodeAttributeGenerator) – gene node attribute generator for IF image data

  • image_url (str) – Base URL for image download from Human Protein Atlas

  • skip_logging (bool) – If True skip logging, if None or False do NOT skip logging

  • provenance (dict)

  • input_data_dict (dict)

  • provenance_utils (ProvenanceUtil) – Wrapper for fairscape-cli which is used for RO-Crate creation and population

IMG_SUFFIX = '.jpg'
SAMPLES_FILEKEY = 'samples'
UNIQUE_FILEKEY = 'unique'
generate_readme()[source]
static get_example_provenance(requiredonly=True, with_ids=False)[source]

Gets a dict of provenance parameters needed to add/register a dataset with FAIRSCAPE

Parameters:
  • requiredonly (bool) – If True only output required fields, otherwise output all fields. This value is ignored if with_ids is True

  • with_ids (bool) – If True only output the fields to set dataset guids and ignore value of requiredonly parameter.

Returns:

get_image_gene_node_attributes_file(fold)[source]

Gets full path to image gene node attribute file under output directory created when invoking run()

Returns:

Path to file

Return type:

str

get_image_gene_node_errors_file()[source]

Gets full path to image gene node attribute errors file under output directory created when invoking run()

Returns:

Path to file

Return type:

str

run()[source]

Downloads images to output directory specified in constructor using tsvfile for list of images to download

Raises:

CellMapsImageDownloaderError – If there is an error

Returns:

0 upon success, otherwise failure

class cellmaps_imagedownloader.runner.FakeImageDownloader[source]

Bases: ImageDownloader

Creates fake download by downloading the first image in each color from Human Protein Atlas and making renamed copies. The download_file() function is used to download the first image of each color

Constructor

download_images(download_list=None)[source]

Downloads 1st image from server and then and makes renamed copies for subsequent images

Parameters:

download_list (list of tuple)

Returns:

class cellmaps_imagedownloader.runner.ImageDownloader[source]

Bases: object

Abstract class that defines interface for classes that download images

download_images(download_list=None)[source]

Subclasses should implement

Parameters:

download_list (list) – list of tuples where first element is full URL of image to download and 2nd element is destination path

Returns:

class cellmaps_imagedownloader.runner.MultiProcessImageDownloader(poolsize=4, skip_existing=False, override_dfunc=None)[source]

Bases: ImageDownloader

Uses multiprocess package to download images in parallel

Constructor

Warning

Exceeding poolsize of 4 causes errors from Human Protein Atlas site

Parameters:
  • poolsize (int) – Number of concurrent downloaders to use.

  • skip_existing (bool) – If True skip download if image file exists and has size greater then 0

  • override_dfunc (function) – Function that takes a tuple (image URL, download str path) and downloads the image. If None download_file() function is used

POOL_SIZE = 4
download_images(download_list=None)[source]

Downloads images returning a list of failed downloads

from cellmaps_imagedownloader.runner import MultiProcessImageDownloader

dloader = MultiProcessImageDownloader(poolsize=2)

d_list = [('https://images.proteinatlas.org/992/1_A1_1_red.jpg',
           '/tmp/1_A1_1_red.jpg')]
failed = dloader.download_images(download_list=d_list)
Parameters:

download_list (list of tuple) – Each tuple of format (image URL, dest file path)

Returns:

Failed downloads, format of tuple (http status code, text of error, (link, destfile))

Return type:

list of tuple

cellmaps_imagedownloader.runner.download_file(downloadtuple)[source]

Downloads file pointed to by ‘download_url’ to ‘destfile’

Note

Default download function used by MultiProcessImageDownloader

Parameters:

downloadtuple (tuple) – (download link, dest file path)

Returns:

None upon success otherwise: (requests status code, text from request, downloadtuple)

Return type:

tuple

cellmaps_imagedownloader.runner.download_file_skip_existing(downloadtuple)[source]

Downloads file in downloadtuple unless the file already exists with a size greater then 0 bytes, in which case function just returns

Parameters:

downloadtuple (tuple) – (download link, dest file path)

Returns:

None upon success otherwise: (requests status code, text from request, downloadtuple)

Return type:

tuple

cellmaps_imagedownloader.gene module

class cellmaps_imagedownloader.gene.CM4AITableConverter(cm4ai=None, fileprefix='B2AI_1_', cell_line='MDA-MB-468')[source]

Bases: object

Converts CM4AI table in an RO-Crate to samples and unique lists compatible with ImageGeneNodeAttributeGenerator

Constructor

Parameters:

cm4ai (str) – Path to CM4AI RO-Crate, or CM4AI RO-Crate antibody_gene_table or URL where CM4AI RO-Crate can be downloaded

get_samples_and_unique_lists()[source]

Gets samples and unique list compatible with ImageGeneNodeAttributeGenerator

Returns:

(samples list, unique list)

Return type:

tuple

class cellmaps_imagedownloader.gene.GeneNodeAttributeGenerator[source]

Bases: object

Base class for GeneNodeAttribute Generator

Constructor

get_gene_node_attributes()[source]

Should be implemented by subclasses

Raises:

NotImplementedError – Always

class cellmaps_imagedownloader.gene.GeneQuery(mygeneinfo=<mygene.MyGeneInfo object>)[source]

Bases: object

Gets information about genes from mygene

Constructor

get_symbols_for_genes(genelist=None, scopes='_id')[source]

Queries for genes via GeneQuery() object passed in via constructor

Parameters:
  • genelist (list) – genes to query for valid symbols and ensembl ids

  • scopes (str) – field to query on _id for gene id, ensemble.gene for ENSEMBLE IDs

Returns:

result from mygene which is a list of dict objects where each dict is of format:

{ 'query': 'ID',
  '_id': 'ID', '_score': #.##,
  'ensembl': { 'gene': 'ENSEMBLEID' },
  'symbol': 'GENESYMBOL' }

Return type:

list

querymany(queries, species=None, scopes=None, fields=None)[source]

Simple wrapper that calls MyGene querymany returning the results

Parameters:
  • queries (list) – list of gene ids/symbols to query

  • species (str)

  • scopes (str)

  • fields (list)

Returns:

dict from MyGene usually in format of

Return type:

list

class cellmaps_imagedownloader.gene.ImageGeneNodeAttributeGenerator(samples_list=None, unique_list=None, genequery=<cellmaps_imagedownloader.gene.GeneQuery object>)[source]

Bases: GeneNodeAttributeGenerator

Creates Image Gene Node Attributes table

Constructor

samples_list is expected to be a list of dict objects with this format:

# TODO: Move this to a separate data document

{
 'filename': HPA FILENAME,
 'if_plate_id': HPA PLATE ID,
 'position': POSITION,
 'sample': SAMPLE,
 'locations': COMMA DELIMITED LOCATIONS,
 'antibody': ANTIBODY_ID,
 'ensembl_ids': COMMA DELIMITED ENSEMBLID IDS,
 'gene_names': COMMA DELIMITED GENE SYMBOLS
}

Example:

{
 'filename': '/archive/1/1_A1_1_',
 'if_plate_id': '1',
 'position': 'A1',
 'sample': '1',
 'locations': 'Golgi apparatus',
 'antibody': 'HPA000992',
 'ensembl_ids': 'ENSG00000066455',
 'gene_names': 'GOLGA5'
}

unique_list is expected to be a list of dict objects with this format:

{
 'antibody': ANTIBODY,
 'ensembl_ids': COMMA DELIMITED ENSEMBL IDS,
 'gene_names': COMMA DELIMITED GENE SYMBOLS,
 'atlas_name': ATLAS NAME?,
 'locations': COMMA DELIMITED LOCATIONS IN CELL,
 'n_location': NUMBER OF LOCATIONS IN CELL,
 }

Example:

{
 'antibody': 'HPA040086',
 'ensembl_ids': 'ENSG00000094914',
 'gene_names': 'AAAS',
 'atlas_name': 'U-2',
 'locations': 'OS,Nuclear membrane',
 'n_location': '2',
 }
Parameters:
  • samples_list (list) – List of samples

  • unique_list (list) – List of unique samples

  • genequery (GeneQuery) – Object to query for updated gene symbols

LINKPREFIX_HEADER = 'linkprefix'

Column labels for samples file

SAMPLES_HEADER_COLS = ['filename', 'if_plate_id', 'position', 'sample', 'locations', 'antibody', 'ensembl_ids', 'gene_names']
UNIQUE_HEADER_COLS = ['antibody', 'ensembl_ids', 'gene_names', 'atlas_name', 'locations', 'n_location']

Column labels for unique file

filter_samples_by_sample_urlmap(sample_url_map)[source]

Removes samples that lack a URL as noted in sample_url_map passed in.

Raises:

CellMapsImageDownloaderError – if internal samples list is None

Parameters:

sample_url_map (dict) – map where key is image id and value is URL

get_dicts_of_gene_to_antibody_filename()[source]

Gets a tuple of dictionaries from the sample list passed in via the constructor.

Returns:

(dict of ensembl_id => antibody, dict of antibody => filename, dict of antibody => comma delimited ambiguous ensembl_ids)

Return type:

tuple

get_gene_node_attributes(fold=1)[source]

Using samples_list and unique_list, builds a list of dict objects with updated Gene Symbols.

Format of each resulting dict:

{'name': GENE_SYMBOL,
 'represents': ENSEMBL_ID,
 'ambiguous': AMBIGUOUS_GENES,
 'antibody': ANTIBODY,
 'filename': FILENAME}

Example

{'ENSG00000066455': {'name': 'GOLGA5',
                     'represents': 'ensembl:ENSG00000066455',
                     'ambiguous': '',
                     'antibody': 'HPA000992',
                     'filename': '1_A1_2_,1_A1_1_'}}
Returns:

(list of dict, list of errors)

Return type:

tuple

static get_image_id_for_sample(sample)[source]

Gets image id for sample passed in

Parameters:

sample (dict) –

Assumed to be a dict of following format:

{'antibody': 'HPA0####',
 'position': 'XXX',
 'sample': 'XXX',
 'if_plate_id: 'XXX'}

Raises:

CellMapsImageDownloaderError – If sample is None, not a dict or is missing any of these keys antibody, position, sample, if_plate_id

Returns:

<ANTIBODY WITH HPA0*|CAB0* REMOVED>/<IF_PLATE_ID>_<POSITION>_<SAMPLE>_

Return type:

str

static get_samples_from_csvfile(csvfile=None)[source]

Loads samples from a CSV file into a list of dictionaries.

Parameters:

csvfile (str) – Path to the CSV file to read samples from.

Returns:

A list of dictionaries, where each dictionary represents a sample extracted from the CSV file.

Return type:

list

get_samples_list()[source]

Gets samples_list passed in via the constructor that has been filtered by unique_list passed in via the constructor

Returns:

list of samples set via constructor

Return type:

list

get_samples_list_image_ids()[source]

Gets a list of image ids from the samples set via constructor

Raises:

CellMapsImageDownloaderError – if samples list in constructor is None or if there was an issue parsing a sample

Returns:

image ids

Return type:

list

get_unique_list()[source]

Gets antibodies_list passed in via the constructor

Returns:

static get_unique_list_from_csvfile(csvfile=None)[source]
Parameters:

csvfile

Returns:

write_samples_to_csvfile(csvfile=None)[source]

Writes samples to file

Parameters:

csvfile (str) – path to file to write

write_unique_list_to_csvfile(csvfile=None)[source]

Writes unique list to file

Parameters:

csvfile (str) – path to file to write

cellmaps_imagedownloader.proteinatlas module

class cellmaps_imagedownloader.proteinatlas.CM4AIImageCopyTupleGenerator(samples_list=None)[source]

Bases: object

Gets URL to download images for given samples

Parameters:

samples_list

get_next_image_url(color_download_map=None)[source]
Parameters:

color_download_map – dict of colors to location on filesystem {'red': '/tmp/foo/red'}

Returns:

list of tuples (image download URL, destination file path)

Return type:

list

get_sample_urlmap()[source]

Gets map of ANTIBODY/PLATE_ID_POSITION_SAMPLE_ => download url of _blue_red_green.jpg

Returns:

map or None

Return type:

dict

class cellmaps_imagedownloader.proteinatlas.ImageDownloadTupleGenerator(samples_list=None, reader=None, valid_image_ids=None)[source]

Bases: object

Gets URL to download images for given samples

Constructor

Parameters:
  • samples_list (list)

  • reader (ProteinAtlasImageUrlReader) – Used to get download URLs for images

  • valid_image_ids (set) – Image ids that need a download URL in format of <ANTIBODY ID minus HPA or CAB prefix>/<IMAGE ID>

get_next_image_url(color_download_map=None)[source]

Generator function that gets the next image URL to download

Parameters:

color_download_map – dict of colors to location on filesystem {'red': '/tmp/foo/red'}

Returns:

list of tuples (image download URL, destination file path)

Return type:

list

get_sample_urlmap()[source]

Gets map of ANTIBODY/PLATE_ID_POSITION_SAMPLE_ => download url of _blue_red_green.jpg

Returns:

map or None

Return type:

dict

class cellmaps_imagedownloader.proteinatlas.LinkPrefixImageDownloadTupleGenerator(samples_list=None)[source]

Bases: object

Gets URL to download images for given samples

Parameters:

samples_list

get_next_image_url(color_download_map=None)[source]
Parameters:

color_download_map – dict of colors to location on filesystem {'red': '/tmp/foo/red'}

Returns:

list of tuples (image download URL, destination file path)

Return type:

list

get_sample_urlmap()[source]

Gets map of ANTIBODY/PLATE_ID_POSITION_SAMPLE_ => download url of _blue_red_green.jpg

Returns:

map or None

Return type:

dict

class cellmaps_imagedownloader.proteinatlas.ProteinAtlasImageUrlReader(reader=None)[source]

Bases: object

Takes a proteinatlas generator to get value between <imageUrl>XXX</imageUrl> lines with the keyword _blue in them

Constructor

Parameters:

reader (ProteinAtlasReader)

get_next_image_id_and_url()[source]
Returns:

(image id, image_url)

Return type:

tuple

class cellmaps_imagedownloader.proteinatlas.ProteinAtlasProcessor(outdir=None, proteinatlas=None, proteinlist_file=None, cell_line=None)[source]

Bases: object

get_sample_list_from_hpa()[source]
class cellmaps_imagedownloader.proteinatlas.ProteinAtlasReader(outdir=None, proteinatlas=None)[source]

Bases: object

Returns contents of proteinatlas.xml file one line at a time

Constructor

Parameters:
  • outdir (str) – Path to directory where results can be written to

  • proteinatlas (str) – URL or path to proteinatlas.xml| proteinatlas.xml.gz file

DEFAULT_PROTEINATLAS_URL = 'https://www.proteinatlas.org/download/proteinatlas.xml.gz'
readline()[source]

Generator that returns next line of proteinatlas data set via constructor

Returns:

next line of file

Return type:

str

cellmaps_imagedownloader.proteinatlas.download_proteinalas_file(outdir, proteinatlas, max_retries=3, retry_wait=10)[source]

cellmaps_imagedownloader.cellmaps_imagedownloadercmd module

cellmaps_imagedownloader.cellmaps_imagedownloadercmd.main(args)[source]

Main entry point for program

Parameters:

args (list) – arguments passed to command line usually sys.argv[1:]()

Returns:

return value of cellmaps_imagedownloader.runner.CellmapsImageDownloader.run() or 2 if an exception is raised

Return type:

int

cellmaps_imagedownloader.exceptions module

exception cellmaps_imagedownloader.exceptions.CellMapsImageDownloaderError[source]

Bases: Exception

Base exception for CellMapsImageDownloader

Module contents

Top-level package for cellmaps_imagedownloader.