Module: `Cell_BLAST.blast`

Cell BLAST based on DIRECTi models

Classes:

`BLAST`(models, ref[, distance_metric, ...])	Cell BLAST
`Hits`(blast, hits, dist, pval, query)	BLAST hits

Functions:

`amd`(x, y, x_posterior, y_posterior[, eps, ...])	x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim
`ed`(x, y)	x : latent_dim y : latent_dim
`md`(x, y, x_posterior, y_posterior)	x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim
`npd_v1`(x, y, x_posterior, y_posterior[, eps])	x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim
`npd_v2`(x, y, x_posterior, y_posterior[, eps])	x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim
`sankey`(query, ref[, title, width, height, ...])	Make a sankey diagram of query-reference mapping (only works in ipython notebooks).

class Cell_BLAST.blast.BLAST(models, ref, distance_metric='npd_v1', n_posterior=50, n_empirical=10000, cluster_empirical=False, eps=None, force_components=True, **kwargs)[source]

Cell BLAST

Parameters:

models (List[DIRECTi]) – A list of “DIRECTi” models.
ref (AnnData) – A reference dataset.
distance_metric (str) – Cell-to-cell distance metric to use, should be among {“npd_v1”, “npd_v2”, “md”, “amd”, “ed”}.
n_posterior (int) – How many samples from the posterior distribution to use for estimating posterior distance. Irrelevant for distance_metric=”ed”.
n_empirical (int) – Number of random cell pairs to use when estimating empirical distribution of cell-to-cell distance.
cluster_empirical (bool) – Whether to build an empirical distribution for each intrinsic cluster independently.
eps (Optional[float]) – A small number added to the normalization factors used in certain posterior-based distance metrics to improve numeric stability. If not specified, a recommended value will be used according to the specified distance metric.
force_components (bool) – Whether to compute all the necessary components upon initialization. If set to False, necessary components will be computed on the fly when performing queries.

Examples

A typical BLAST pipeline is described below.

Assuming we have a list of directi.DIRECTi models already fitted on some reference data, we can construct a BLAST object by feeding the pretrained models and the reference data to the BLAST constructor.

>>> blast = BLAST(models, reference)

We can efficiently query the reference and obtain initial hits via the BLAST.query() method:

>>> hits = blast.query(query)

Then we filter the initial hits by using more accurate metrics e.g. empirical p-value based on posterior distance), and pooling together information across multiple models.

>>> hits = hits.reconcile_models().filter(by="pval", cutoff=0.05)

Finally, we use the BLAST.annotate() method to obtain predictions based on reference annotations, e.g. “cell_ontology_class” in this case.

>>> annotation = hits.annotate("cell_ontology_class")

See the BLAST ipython notebook (Vignettes) for live examples.

Methods:

`align`(query[, n_jobs, random_seed, path])	Align internal DIRECTi models with query datasets (fine tuning).
`load`(path[, mode])	Load BLAST object from a directory.
`query`(query[, n_neighbors, store_dataset, ...])	BLAST query
`save`(path[, only_used_genes])	Save BLAST object to a directory.

align(query, n_jobs='__UsE_gLoBaL__', random_seed='__UsE_gLoBaL__', path=None, **kwargs)[source]

Align internal DIRECTi models with query datasets (fine tuning).

Parameters:

query (Union[AnnData, Mapping[str, AnnData]]) – A query dataset or a dict of query datasets, which will be aligned to the reference.
n_jobs (int) – Number of parallel jobs to run when building the BLAST index, If not specified, config.N_JOBS will be used. Note that each (tensorflow) job could be distributed on multiple CPUs for a single “job”.
random_seed (int) – Random seed for posterior sampling. If not specified, config.RANDOM_SEED will be used.
path (Optional[str]) – Specifies a path to store temporary files.
kwargs – Additional keyword parameters passed to directi.align_DIRECTi().

Returns:

A new BLAST object with aligned internal models.

Return type:

blast

classmethod load(path, mode=1, **kwargs)[source]

Load BLAST object from a directory.

Parameters:

path (str) – Specifies a path to load from.
mode (int) – If mode is set to MINIMAL, model loading will be accelerated by only loading the encoders, but aligning BLAST (fine-tuning) would not be available. Should be among {cb.blast.NORMAL, cb.blast.MINIMAL}

Returns:

Loaded BLAST object.

Return type:

blast

query(query, n_neighbors=5, store_dataset=False, n_jobs='__UsE_gLoBaL__', random_seed='__UsE_gLoBaL__')[source]

BLAST query

Parameters:

query (AnnData) – Query transcriptomes.
n_neighbors (int) – Initial number of nearest neighbors to search in each model.
store_dataset (bool) – Whether to store query dataset in the returned hit object. Note that this is necessary if Hits.gene_deviation() is to be used.
n_jobs (int) – Number of parallel jobs to run when performing query. If not specified, config.N_JOBS will be used. Note that each (tensorflow) job could be distributed on multiple CPUs for a single “job”.
random_seed (int) – Random seed for posterior sampling. If not specified, utils.RANDOM_SEED will be used.

Returns:

Query hits

Return type:

hits

save(path, only_used_genes=True)[source]

Save BLAST object to a directory.

Parameters:

path (str) – Specifies a path to save the BLAST object.
only_used_genes (bool) – Whether to preserve only the genes used by models.

Return type:

None

class Cell_BLAST.blast.Hits(blast, hits, dist, pval, query)[source]

BLAST hits

Parameters:

blast (BLAST) – The BLAST object producing the hits
hits (List[ndarray]) – Indices of hit cell in the reference dataset. Each list element contains hit cell indices for a query cell.
dist (List[ndarray]) – Hit cell distances. Each list element contains distances for a query cell. Each list element is a \(n\_hits \times n\_models\) matrix, with matrix entries corresponding to the distance to each hit cell under each model.
pval (List[ndarray]) – Hit cell empirical p-values. Each list element contains p-values for a query cell. Each list element is a \(n\_hits \times n\_models\) matrix, with matrix entries corresponding to the empirical p-value of each hit cell under each model.
query (AnnData) – Query dataset

Methods:

`annotate`(field[, min_hits, ...])	Annotate query cells based on existing annotations of hit cells via majority voting.
`blast2co`(cl_dag[, cl_field, min_hits, ...])	Annotate query cells based on existing annotations of hit cells via the cell-ontology-aware BLAST2CO method.
`filter`([by, cutoff, model_tolerance, n_jobs])	Filter hits by posterior distance or p-value
`gene_gradient`([eval_point, ...])	Compute gene-wise gradient for each pair of query-hit cells based on query-hit deviation in the latent space.
`reconcile_models`([dist_method, pval_method])	Integrate model-specific distances and empirical p-values.
`to_data_frames`()	Construct hit data frames for query cells.

annotate(field, min_hits=2, majority_threshold=0.5, return_evidence=False)[source]

Annotate query cells based on existing annotations of hit cells via majority voting.

Parameters:

field (str) – Specifies a meta column in anndata.AnnData.obs.
min_hits (int) – Minimal number of hits required for annotating a query cell, otherwise the query cell will be rejected.
majority_threshold (float) – Minimal majority fraction (not inclusive) required for confident annotation. Only effective when predicting categorical variables. If the threshold is not met, annotation will be “ambiguous”.
return_evidence (bool) – Whether to return evidence level of the annotations.

Returns:

Each row contains the inferred annotation for a query cell. If return_evidence is set to False, the data frame contains only one column, i.e. the inferred annotation. If return_evidence is set to True, the data frame also contains the number of hits, as well as the majority fraction (only for categorical annotations) for each query cell.

Return type:

prediction

blast2co(cl_dag, cl_field='cell_ontology_class', min_hits=2, thresh=0.5, min_path=4)[source]

Annotate query cells based on existing annotations of hit cells via the cell-ontology-aware BLAST2CO method.

Parameters:

cl_dag (CellTypeDAG) – Cell ontology DAG
cl_field (str) – Specify the obs column containing cell ontology annotation in the reference dataset
min_hits (int) – Minimal number of hits required for annotating a query cell, otherwise the query cell will be rejected.
thresh (float) – Scoring threshold based on 1 - pvalue.
min_path (int) – Minimal allowed value of the maximal distance to root for a prediction to be made.

Returns:

Each row contains the inferred annotation for a query cell.

Return type:

prediction

filter(by='pval', cutoff=0.05, model_tolerance=0, n_jobs=1)[source]

Filter hits by posterior distance or p-value

Parameters:

by (str) – Specifies a metric based on which to filter hits. Should be among {“dist”, “pval”}.
cutoff (float) – Cutoff when filtering hits.
model_tolerance (int) – Maximal number of models allowed in which the cutoff is not satisfied, above which the query cell will be rejected. Irrelevant for reconciled hits.
n_jobs (int) – Number of parallel jobs to run.

Returns:

Hit object containing remaining hits after filtering

Return type:

filtered_hits

gene_gradient(eval_point='query', normalize_deviation=True, avg_models=True, n_jobs='__UsE_gLoBaL__')[source]

Compute gene-wise gradient for each pair of query-hit cells based on query-hit deviation in the latent space. Useful for model interpretation.

Parameters:

eval_point (str) – At which point should the gradient be evaluated. Valid options include: {“query”, “ref”, “both”}
normalize_deviation (bool) – Whether to normalize query-hit deivation in the latent space.
avg_models (bool) – Whether to average gene-wise gradients across different models
n_jobs (int) – Number of parallel jobs to run when performing query. If not specified, config.N_JOBS will be used. Note that each (tensorflow) job could be distributed on multiple CPUs for a single “job”.

Returns:

A list with length equal to the number of query cells, where each element is a np.ndarray containing gene-wise gradient for every hit cell of a query cell. The np.ndarray`s are of shape :math:`n_hits times n_genes if avg_models is set to True, or \(n\_hits \times n\_models \times n\_genes\) if avg_models is set to False.

Return type:

gene_gradient

reconcile_models(dist_method='mean', pval_method='gmean')[source]

Integrate model-specific distances and empirical p-values.

Parameters:

dist_method (str) – Specifies how to integrate distances across difference models. Should be among {“mean”, “gmean”, “min”, “max”}.
pval_method (str) – Specifies how to integrate empirical p-values across different models. Should be among {“mean”, “gmean”, “min”, “max”}.

Returns:

Hit object containing reconciled

Return type:

reconciled_hits

to_data_frames()[source]

Construct hit data frames for query cells. Note that only reconciled Hits objects are supported.

Returns:: Each element is hit data frame for a cell
Return type:: data_frame_dicts

Cell_BLAST.blast.amd(x, y, x_posterior, y_posterior, eps=0.1, x_is_pcasd=False, y_is_pcasd=False)[source]

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

Return type:: ndarray

Cell_BLAST.blast.ed(x, y)[source]: x : latent_dim y : latent_dim

Cell_BLAST.blast.md(x, y, x_posterior, y_posterior)[source]

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

Return type:: ndarray

Cell_BLAST.blast.npd_v1(x, y, x_posterior, y_posterior, eps=0.0)[source]

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

Return type:: ndarray

Cell_BLAST.blast.npd_v2(x, y, x_posterior, y_posterior, eps=0.1)[source]

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

Return type:: ndarray

Cell_BLAST.blast.sankey(query, ref, title='Sankey', width=500, height=500, tint_cutoff=1, font='Arial', font_size=10.0, suppress_plot=False)[source]

Make a sankey diagram of query-reference mapping (only works in ipython notebooks).

Parameters:

query (ndarray) – 1-dimensional array of query annotation.
ref (ndarray) – 1-dimensional array of BLAST prediction based on reference database.
title (str) – Diagram title.
width (int) – Graph width.
height (int) – Graph height.
tint_cutoff (int) – Cutoff below which sankey flows are shown in a tinter color.
font (str) – Font family used for the plot.
font_size (float) – Font size for the plot.
suppress_plot (bool) – Whether to suppress plotting and only return the figure dict.

Returns:

Figure object fed to iplot of the plotly module to produce the plot.

Return type:

fig

Module: Cell_BLAST.blast

Module: `Cell_BLAST.blast`