Module: Cell_BLAST.blast
Cell BLAST based on DIRECTi models
Classes:
|
Cell BLAST |
|
BLAST hits |
Functions:
|
x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim |
|
x : latent_dim y : latent_dim |
|
x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim |
|
x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim |
|
x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim |
|
Make a sankey diagram of query-reference mapping (only works in ipython notebooks). |
- class Cell_BLAST.blast.BLAST(models, ref, distance_metric='npd_v1', n_posterior=50, n_empirical=10000, cluster_empirical=False, eps=None, force_components=True, **kwargs)[source]
Cell BLAST
- Parameters:
ref (
AnnData
) – A reference dataset.distance_metric (
str
) – Cell-to-cell distance metric to use, should be among {“npd_v1”, “npd_v2”, “md”, “amd”, “ed”}.n_posterior (
int
) – How many samples from the posterior distribution to use for estimating posterior distance. Irrelevant for distance_metric=”ed”.n_empirical (
int
) – Number of random cell pairs to use when estimating empirical distribution of cell-to-cell distance.cluster_empirical (
bool
) – Whether to build an empirical distribution for each intrinsic cluster independently.eps (
Optional
[float
]) – A small number added to the normalization factors used in certain posterior-based distance metrics to improve numeric stability. If not specified, a recommended value will be used according to the specified distance metric.force_components (
bool
) – Whether to compute all the necessary components upon initialization. If set to False, necessary components will be computed on the fly when performing queries.
Examples
A typical BLAST pipeline is described below.
Assuming we have a list of
directi.DIRECTi
models already fitted on some reference data, we can construct a BLAST object by feeding the pretrained models and the reference data to theBLAST
constructor.>>> blast = BLAST(models, reference)
We can efficiently query the reference and obtain initial hits via the
BLAST.query()
method:>>> hits = blast.query(query)
Then we filter the initial hits by using more accurate metrics e.g. empirical p-value based on posterior distance), and pooling together information across multiple models.
>>> hits = hits.reconcile_models().filter(by="pval", cutoff=0.05)
Finally, we use the
BLAST.annotate()
method to obtain predictions based on reference annotations, e.g. “cell_ontology_class” in this case.>>> annotation = hits.annotate("cell_ontology_class")
See the BLAST ipython notebook (Vignettes) for live examples.
Methods:
align
(query[, n_jobs, random_seed, path])Align internal DIRECTi models with query datasets (fine tuning).
load
(path[, mode])Load BLAST object from a directory.
query
(query[, n_neighbors, store_dataset, ...])BLAST query
save
(path[, only_used_genes])Save BLAST object to a directory.
- align(query, n_jobs='__UsE_gLoBaL__', random_seed='__UsE_gLoBaL__', path=None, **kwargs)[source]
Align internal DIRECTi models with query datasets (fine tuning).
- Parameters:
query (
Union
[AnnData
,Mapping
[str
,AnnData
]]) – A query dataset or a dict of query datasets, which will be aligned to the reference.n_jobs (
int
) – Number of parallel jobs to run when building the BLAST index, If not specified,config.N_JOBS
will be used. Note that each (tensorflow) job could be distributed on multiple CPUs for a single “job”.random_seed (
int
) – Random seed for posterior sampling. If not specified,config.RANDOM_SEED
will be used.path (
Optional
[str
]) – Specifies a path to store temporary files.kwargs – Additional keyword parameters passed to
directi.align_DIRECTi()
.
- Returns:
A new BLAST object with aligned internal models.
- Return type:
blast
- classmethod load(path, mode=1, **kwargs)[source]
Load BLAST object from a directory.
- Parameters:
- Returns:
Loaded BLAST object.
- Return type:
blast
- query(query, n_neighbors=5, store_dataset=False, n_jobs='__UsE_gLoBaL__', random_seed='__UsE_gLoBaL__')[source]
BLAST query
- Parameters:
query (
AnnData
) – Query transcriptomes.n_neighbors (
int
) – Initial number of nearest neighbors to search in each model.store_dataset (
bool
) – Whether to store query dataset in the returned hit object. Note that this is necessary ifHits.gene_deviation()
is to be used.n_jobs (
int
) – Number of parallel jobs to run when performing query. If not specified,config.N_JOBS
will be used. Note that each (tensorflow) job could be distributed on multiple CPUs for a single “job”.random_seed (
int
) – Random seed for posterior sampling. If not specified,utils.RANDOM_SEED
will be used.
- Returns:
Query hits
- Return type:
hits
- class Cell_BLAST.blast.Hits(blast, hits, dist, pval, query)[source]
BLAST hits
- Parameters:
hits (
List
[ndarray
]) – Indices of hit cell in the reference dataset. Each list element contains hit cell indices for a query cell.dist (
List
[ndarray
]) – Hit cell distances. Each list element contains distances for a query cell. Each list element is a \(n\_hits \times n\_models\) matrix, with matrix entries corresponding to the distance to each hit cell under each model.pval (
List
[ndarray
]) – Hit cell empirical p-values. Each list element contains p-values for a query cell. Each list element is a \(n\_hits \times n\_models\) matrix, with matrix entries corresponding to the empirical p-value of each hit cell under each model.query (
AnnData
) – Query dataset
Methods:
annotate
(field[, min_hits, ...])Annotate query cells based on existing annotations of hit cells via majority voting.
blast2co
(cl_dag[, cl_field, min_hits, ...])Annotate query cells based on existing annotations of hit cells via the cell-ontology-aware BLAST2CO method.
filter
([by, cutoff, model_tolerance, n_jobs])Filter hits by posterior distance or p-value
gene_gradient
([eval_point, ...])Compute gene-wise gradient for each pair of query-hit cells based on query-hit deviation in the latent space.
reconcile_models
([dist_method, pval_method])Integrate model-specific distances and empirical p-values.
Construct hit data frames for query cells.
- annotate(field, min_hits=2, majority_threshold=0.5, return_evidence=False)[source]
Annotate query cells based on existing annotations of hit cells via majority voting.
- Parameters:
field (
str
) – Specifies a meta column in anndata.AnnData.obs.min_hits (
int
) – Minimal number of hits required for annotating a query cell, otherwise the query cell will be rejected.majority_threshold (
float
) – Minimal majority fraction (not inclusive) required for confident annotation. Only effective when predicting categorical variables. If the threshold is not met, annotation will be “ambiguous”.return_evidence (
bool
) – Whether to return evidence level of the annotations.
- Returns:
Each row contains the inferred annotation for a query cell. If
return_evidence
is set to False, the data frame contains only one column, i.e. the inferred annotation. Ifreturn_evidence
is set to True, the data frame also contains the number of hits, as well as the majority fraction (only for categorical annotations) for each query cell.- Return type:
prediction
- blast2co(cl_dag, cl_field='cell_ontology_class', min_hits=2, thresh=0.5, min_path=4)[source]
Annotate query cells based on existing annotations of hit cells via the cell-ontology-aware BLAST2CO method.
- Parameters:
cl_dag (
CellTypeDAG
) – Cell ontology DAGcl_field (
str
) – Specify theobs
column containing cell ontology annotation in the reference datasetmin_hits (
int
) – Minimal number of hits required for annotating a query cell, otherwise the query cell will be rejected.thresh (
float
) – Scoring threshold based on 1 - pvalue.min_path (
int
) – Minimal allowed value of the maximal distance to root for a prediction to be made.
- Returns:
Each row contains the inferred annotation for a query cell.
- Return type:
prediction
- filter(by='pval', cutoff=0.05, model_tolerance=0, n_jobs=1)[source]
Filter hits by posterior distance or p-value
- Parameters:
by (
str
) – Specifies a metric based on which to filter hits. Should be among {“dist”, “pval”}.cutoff (
float
) – Cutoff when filtering hits.model_tolerance (
int
) – Maximal number of models allowed in which the cutoff is not satisfied, above which the query cell will be rejected. Irrelevant for reconciled hits.n_jobs (
int
) – Number of parallel jobs to run.
- Returns:
Hit object containing remaining hits after filtering
- Return type:
filtered_hits
- gene_gradient(eval_point='query', normalize_deviation=True, avg_models=True, n_jobs='__UsE_gLoBaL__')[source]
Compute gene-wise gradient for each pair of query-hit cells based on query-hit deviation in the latent space. Useful for model interpretation.
- Parameters:
eval_point (
str
) – At which point should the gradient be evaluated. Valid options include: {“query”, “ref”, “both”}normalize_deviation (
bool
) – Whether to normalize query-hit deivation in the latent space.avg_models (
bool
) – Whether to average gene-wise gradients across different modelsn_jobs (
int
) – Number of parallel jobs to run when performing query. If not specified,config.N_JOBS
will be used. Note that each (tensorflow) job could be distributed on multiple CPUs for a single “job”.
- Returns:
A list with length equal to the number of query cells, where each element is a
np.ndarray
containing gene-wise gradient for every hit cell of a query cell. Thenp.ndarray`s are of shape :math:`n_hits times n_genes
ifavg_models
is set to True, or \(n\_hits \times n\_models \times n\_genes\) ifavg_models
is set to False.- Return type:
gene_gradient
- Cell_BLAST.blast.amd(x, y, x_posterior, y_posterior, eps=0.1, x_is_pcasd=False, y_is_pcasd=False)[source]
x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim
- Return type:
- Cell_BLAST.blast.md(x, y, x_posterior, y_posterior)[source]
x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim
- Return type:
- Cell_BLAST.blast.npd_v1(x, y, x_posterior, y_posterior, eps=0.0)[source]
x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim
- Return type:
- Cell_BLAST.blast.npd_v2(x, y, x_posterior, y_posterior, eps=0.1)[source]
x : latent_dim y : latent_dim x_posterior : n_posterior * latent_dim y_posterior : n_posterior * latent_dim
- Return type:
- Cell_BLAST.blast.sankey(query, ref, title='Sankey', width=500, height=500, tint_cutoff=1, font='Arial', font_size=10.0, suppress_plot=False)[source]
Make a sankey diagram of query-reference mapping (only works in ipython notebooks).
- Parameters:
query (
ndarray
) – 1-dimensional array of query annotation.ref (
ndarray
) – 1-dimensional array of BLAST prediction based on reference database.title (
str
) – Diagram title.width (
int
) – Graph width.height (
int
) – Graph height.tint_cutoff (
int
) – Cutoff below which sankey flows are shown in a tinter color.font (
str
) – Font family used for the plot.font_size (
float
) – Font size for the plot.suppress_plot (
bool
) – Whether to suppress plotting and only return the figure dict.
- Returns:
Figure object fed to iplot of the plotly module to produce the plot.
- Return type:
fig