Module: Cell_BLAST.blast

Cell BLAST based on DIRECTi models

Classes:

BLAST(models, ref[, distance_metric, ...])

Cell BLAST

Hits(blast, hits, dist, pval, query)

BLAST hits

Functions:

amd(x, y, x_posterior, y_posterior[, eps, ...])

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

ed(x, y)

x : latent_dim y : latent_dim

md(x, y, x_posterior, y_posterior)

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

npd_v1(x, y, x_posterior, y_posterior[, eps])

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

npd_v2(x, y, x_posterior, y_posterior[, eps])

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

sankey(query, ref[, title, width, height, ...])

Make a sankey diagram of query-reference mapping (only works in ipython notebooks).

class Cell_BLAST.blast.BLAST(models, ref, distance_metric='npd_v1', n_posterior=50, n_empirical=10000, cluster_empirical=False, eps=None, force_components=True, **kwargs)[source]

Cell BLAST

Parameters:
  • models (List[DIRECTi]) – A list of “DIRECTi” models.

  • ref (AnnData) – A reference dataset.

  • distance_metric (str) – Cell-to-cell distance metric to use, should be among {“npd_v1”, “npd_v2”, “md”, “amd”, “ed”}.

  • n_posterior (int) – How many samples from the posterior distribution to use for estimating posterior distance. Irrelevant for distance_metric=”ed”.

  • n_empirical (int) – Number of random cell pairs to use when estimating empirical distribution of cell-to-cell distance.

  • cluster_empirical (bool) – Whether to build an empirical distribution for each intrinsic cluster independently.

  • eps (Optional[float]) – A small number added to the normalization factors used in certain posterior-based distance metrics to improve numeric stability. If not specified, a recommended value will be used according to the specified distance metric.

  • force_components (bool) – Whether to compute all the necessary components upon initialization. If set to False, necessary components will be computed on the fly when performing queries.

Examples

A typical BLAST pipeline is described below.

Assuming we have a list of directi.DIRECTi models already fitted on some reference data, we can construct a BLAST object by feeding the pretrained models and the reference data to the BLAST constructor.

>>> blast = BLAST(models, reference)

We can efficiently query the reference and obtain initial hits via the BLAST.query() method:

>>> hits = blast.query(query)

Then we filter the initial hits by using more accurate metrics e.g. empirical p-value based on posterior distance), and pooling together information across multiple models.

>>> hits = hits.reconcile_models().filter(by="pval", cutoff=0.05)

Finally, we use the BLAST.annotate() method to obtain predictions based on reference annotations, e.g. “cell_ontology_class” in this case.

>>> annotation = hits.annotate("cell_ontology_class")

See the BLAST ipython notebook (Vignettes) for live examples.

Methods:

align(query[, n_jobs, random_seed, path])

Align internal DIRECTi models with query datasets (fine tuning).

load(path[, mode])

Load BLAST object from a directory.

query(query[, n_neighbors, store_dataset, ...])

BLAST query

save(path[, only_used_genes])

Save BLAST object to a directory.

align(query, n_jobs='__UsE_gLoBaL__', random_seed='__UsE_gLoBaL__', path=None, **kwargs)[source]

Align internal DIRECTi models with query datasets (fine tuning).

Parameters:
  • query (Union[AnnData, Mapping[str, AnnData]]) – A query dataset or a dict of query datasets, which will be aligned to the reference.

  • n_jobs (int) – Number of parallel jobs to run when building the BLAST index, If not specified, config.N_JOBS will be used. Note that each (tensorflow) job could be distributed on multiple CPUs for a single “job”.

  • random_seed (int) – Random seed for posterior sampling. If not specified, config.RANDOM_SEED will be used.

  • path (Optional[str]) – Specifies a path to store temporary files.

  • kwargs – Additional keyword parameters passed to directi.align_DIRECTi().

Returns:

A new BLAST object with aligned internal models.

Return type:

blast

classmethod load(path, mode=1, **kwargs)[source]

Load BLAST object from a directory.

Parameters:
  • path (str) – Specifies a path to load from.

  • mode (int) – If mode is set to MINIMAL, model loading will be accelerated by only loading the encoders, but aligning BLAST (fine-tuning) would not be available. Should be among {cb.blast.NORMAL, cb.blast.MINIMAL}

Returns:

Loaded BLAST object.

Return type:

blast

query(query, n_neighbors=5, store_dataset=False, n_jobs='__UsE_gLoBaL__', random_seed='__UsE_gLoBaL__')[source]

BLAST query

Parameters:
  • query (AnnData) – Query transcriptomes.

  • n_neighbors (int) – Initial number of nearest neighbors to search in each model.

  • store_dataset (bool) – Whether to store query dataset in the returned hit object. Note that this is necessary if Hits.gene_deviation() is to be used.

  • n_jobs (int) – Number of parallel jobs to run when performing query. If not specified, config.N_JOBS will be used. Note that each (tensorflow) job could be distributed on multiple CPUs for a single “job”.

  • random_seed (int) – Random seed for posterior sampling. If not specified, utils.RANDOM_SEED will be used.

Returns:

Query hits

Return type:

hits

save(path, only_used_genes=True)[source]

Save BLAST object to a directory.

Parameters:
  • path (str) – Specifies a path to save the BLAST object.

  • only_used_genes (bool) – Whether to preserve only the genes used by models.

Return type:

None

class Cell_BLAST.blast.Hits(blast, hits, dist, pval, query)[source]

BLAST hits

Parameters:
  • blast (BLAST) – The BLAST object producing the hits

  • hits (List[ndarray]) – Indices of hit cell in the reference dataset. Each list element contains hit cell indices for a query cell.

  • dist (List[ndarray]) – Hit cell distances. Each list element contains distances for a query cell. Each list element is a \(n\_hits \times n\_models\) matrix, with matrix entries corresponding to the distance to each hit cell under each model.

  • pval (List[ndarray]) – Hit cell empirical p-values. Each list element contains p-values for a query cell. Each list element is a \(n\_hits \times n\_models\) matrix, with matrix entries corresponding to the empirical p-value of each hit cell under each model.

  • query (AnnData) – Query dataset

Methods:

annotate(field[, min_hits, ...])

Annotate query cells based on existing annotations of hit cells via majority voting.

blast2co(cl_dag[, cl_field, min_hits, ...])

Annotate query cells based on existing annotations of hit cells via the cell-ontology-aware BLAST2CO method.

filter([by, cutoff, model_tolerance, n_jobs])

Filter hits by posterior distance or p-value

gene_gradient([eval_point, ...])

Compute gene-wise gradient for each pair of query-hit cells based on query-hit deviation in the latent space.

reconcile_models([dist_method, pval_method])

Integrate model-specific distances and empirical p-values.

to_data_frames()

Construct hit data frames for query cells.

annotate(field, min_hits=2, majority_threshold=0.5, return_evidence=False)[source]

Annotate query cells based on existing annotations of hit cells via majority voting.

Parameters:
  • field (str) – Specifies a meta column in anndata.AnnData.obs.

  • min_hits (int) – Minimal number of hits required for annotating a query cell, otherwise the query cell will be rejected.

  • majority_threshold (float) – Minimal majority fraction (not inclusive) required for confident annotation. Only effective when predicting categorical variables. If the threshold is not met, annotation will be “ambiguous”.

  • return_evidence (bool) – Whether to return evidence level of the annotations.

Returns:

Each row contains the inferred annotation for a query cell. If return_evidence is set to False, the data frame contains only one column, i.e. the inferred annotation. If return_evidence is set to True, the data frame also contains the number of hits, as well as the majority fraction (only for categorical annotations) for each query cell.

Return type:

prediction

blast2co(cl_dag, cl_field='cell_ontology_class', min_hits=2, thresh=0.5, min_path=4)[source]

Annotate query cells based on existing annotations of hit cells via the cell-ontology-aware BLAST2CO method.

Parameters:
  • cl_dag (CellTypeDAG) – Cell ontology DAG

  • cl_field (str) – Specify the obs column containing cell ontology annotation in the reference dataset

  • min_hits (int) – Minimal number of hits required for annotating a query cell, otherwise the query cell will be rejected.

  • thresh (float) – Scoring threshold based on 1 - pvalue.

  • min_path (int) – Minimal allowed value of the maximal distance to root for a prediction to be made.

Returns:

Each row contains the inferred annotation for a query cell.

Return type:

prediction

filter(by='pval', cutoff=0.05, model_tolerance=0, n_jobs=1)[source]

Filter hits by posterior distance or p-value

Parameters:
  • by (str) – Specifies a metric based on which to filter hits. Should be among {“dist”, “pval”}.

  • cutoff (float) – Cutoff when filtering hits.

  • model_tolerance (int) – Maximal number of models allowed in which the cutoff is not satisfied, above which the query cell will be rejected. Irrelevant for reconciled hits.

  • n_jobs (int) – Number of parallel jobs to run.

Returns:

Hit object containing remaining hits after filtering

Return type:

filtered_hits

gene_gradient(eval_point='query', normalize_deviation=True, avg_models=True, n_jobs='__UsE_gLoBaL__')[source]

Compute gene-wise gradient for each pair of query-hit cells based on query-hit deviation in the latent space. Useful for model interpretation.

Parameters:
  • eval_point (str) – At which point should the gradient be evaluated. Valid options include: {“query”, “ref”, “both”}

  • normalize_deviation (bool) – Whether to normalize query-hit deivation in the latent space.

  • avg_models (bool) – Whether to average gene-wise gradients across different models

  • n_jobs (int) – Number of parallel jobs to run when performing query. If not specified, config.N_JOBS will be used. Note that each (tensorflow) job could be distributed on multiple CPUs for a single “job”.

Returns:

A list with length equal to the number of query cells, where each element is a np.ndarray containing gene-wise gradient for every hit cell of a query cell. The np.ndarray`s are of shape :math:`n_hits times n_genes if avg_models is set to True, or \(n\_hits \times n\_models \times n\_genes\) if avg_models is set to False.

Return type:

gene_gradient

reconcile_models(dist_method='mean', pval_method='gmean')[source]

Integrate model-specific distances and empirical p-values.

Parameters:
  • dist_method (str) – Specifies how to integrate distances across difference models. Should be among {“mean”, “gmean”, “min”, “max”}.

  • pval_method (str) – Specifies how to integrate empirical p-values across different models. Should be among {“mean”, “gmean”, “min”, “max”}.

Returns:

Hit object containing reconciled

Return type:

reconciled_hits

to_data_frames()[source]

Construct hit data frames for query cells. Note that only reconciled Hits objects are supported.

Returns:

Each element is hit data frame for a cell

Return type:

data_frame_dicts

Cell_BLAST.blast.amd(x, y, x_posterior, y_posterior, eps=0.1, x_is_pcasd=False, y_is_pcasd=False)[source]

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

Return type:

ndarray

Cell_BLAST.blast.ed(x, y)[source]

x : latent_dim y : latent_dim

Cell_BLAST.blast.md(x, y, x_posterior, y_posterior)[source]

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

Return type:

ndarray

Cell_BLAST.blast.npd_v1(x, y, x_posterior, y_posterior, eps=0.0)[source]

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

Return type:

ndarray

Cell_BLAST.blast.npd_v2(x, y, x_posterior, y_posterior, eps=0.1)[source]

x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim

Return type:

ndarray

Cell_BLAST.blast.sankey(query, ref, title='Sankey', width=500, height=500, tint_cutoff=1, font='Arial', font_size=10.0, suppress_plot=False)[source]

Make a sankey diagram of query-reference mapping (only works in ipython notebooks).

Parameters:
  • query (ndarray) – 1-dimensional array of query annotation.

  • ref (ndarray) – 1-dimensional array of BLAST prediction based on reference database.

  • title (str) – Diagram title.

  • width (int) – Graph width.

  • height (int) – Graph height.

  • tint_cutoff (int) – Cutoff below which sankey flows are shown in a tinter color.

  • font (str) – Font family used for the plot.

  • font_size (float) – Font size for the plot.

  • suppress_plot (bool) – Whether to suppress plotting and only return the figure dict.

Returns:

Figure object fed to iplot of the plotly module to produce the plot.

Return type:

fig