API Reference¶

`SummarizedPy` ¶

A class to hold bulk proteomics (and metabolomics) data and process it for differential expression analysis.

Parameters:	`data` (`ndarray`, default: `None` ) – A 2D array with shape=(features, samples) holding numerical intensity data. `features` (`DataFrame`, default: `None` ) – A DataFrame holding feature metadata. `samples` (`DataFrame`, default: `None` ) – A DataFrame holding sample metadata.

Attributes:

data (ndarray) –

A 2D array with shape=(features, samples) holding numerical intensity data.
features (DataFrame) –

A DataFrame holding feature metadata.
samples (DataFrame) –

A DataFrame holding sample metadata.
history (list) –

A list of strings documenting each valid class module call.
results (DataFrame) –

A DataFrame holding DEA results generated by limma_trend_dea method.

Raises:	`ValueError` – If supplied data, features, or samples are incorrect classes. `TypeError` – If data.shape[0] != features.shape[0] or data.shape[1] != samples.shape[0].

Examples:

Constructing SummarizedPy object from numpy array and pandas DataFrame.

>>> import pandas as pd
>>> import numpy as np
>>> import depy as dp
>>> data = np.array([[1, 2, 3],
>>>                 [4, 5, 6],
>>>                 [7, 8, 9]])
>>> features = pd.DataFrame({"proteinID": ["feature1", "feature2", "feature3"]})
>>> samples = pd.DataFrame({"sample": ["sample1", "sample2", "sample3"]})
>>> sp = dp.SummarizedPy(data=data, features=features, samples=samples)
<SummarizedPy(data=ndarray(shape=(3, 3), dtype=int64), features=DataFrame(shape=(3, 1)), samples=DataFrame(shape=(3, 1)))>

`filter_features(expr=None, mask=None)` ¶

Filter SummarizedPy object based on feature metadata, using either Pandas-like query strings or a mask.

Parameters:	`expr` (`str`, default: `None` ) – A Pandas-style query string that can be interpreted by pd.obj.query(expr=expr). `mask` (`str`, default: `None` ) – A boolean mask for subsetting.

Returns:	`SummarizedPy` – A filtered `SummarizedPy` object.

Raises:	`ValueError` – If no valid `expr` or `mask` argument is supplied.

Examples:

Filter out reverse hits in example dataset PXD000438.

>>> import depy as dp
>>> import re
>>> sp = dp.SummarizedPy().load_example_data()
>>> rev_hits = sp.features["protein_id"].apply(lambda x: bool(re.match("REV", x)))
>>> sp.features["rev"] = rev_hits
>>> sp = sp.filter_features(expr="~rev")
>>> sp = sp.filter_features(mask=~rev_hits) Or using rev_hits as a boolean mask

`filter_missingness(frac=0.75, strategy='all_conditions', condition_column=None)` ¶

Filter SummarizedPy object based on % feature missingness across one of: overall, all conditions, or any condition.

Parameters:

frac (float, default: 0.75 ) –

Minimum percentage valid values. Features with missingness greater than or equal to (1 - frac) will be excluded.
strategy ((overall, all_conditions, any_condition), default: 'overall' ) –
Filtering strategy:
- 'overall' : Require >= frac valid values across all samples.
- 'all_conditions' : Require >= frac valid values in each condition defined by condition_column.
- 'any_condition' : Require >= frac valid values in at least one condition defined by condition_column.
condition_column (str, default: None ) –

Name of column in the samples attribute on which to base filtering, in case of 'all_conditions' or 'any_condition'.

Returns:	`SummarizedPy` – A filtered `SummarizedPy` object.

Raises:	`ValueError` – If an invalid `strategy` is supplied or if `condition_column` is required but missing.

Examples:

Filter out missing values in example dataset PXD000438.

>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.filter_missingness(strategy="overall", frac=0.75)
>>> sp = sp.filter_missingness(strategy="any_condition", condition_column="condition", frac=0.75)
>>> sp = sp.filter_missingness(strategy="all_conditions", condition_column="condition", frac=0.75)

`filter_samples(expr=None, mask=None)` ¶

Filter SummarizedPy object based on sample metadata, using either Pandas-like query strings or a mask.

Parameters:	`expr` (`str`, default: `None` ) – A Pandas-style query string that can be interpreted by pd.obj.query(expr=expr). `mask` (`bool`, default: `None` ) – A boolean mask for subsetting.

Returns:	`SummarizedPy` – A filtered `SummarizedPy` object.

Raises:	`ValueError` – If no valid `expr` or mask argument is supplied.

Examples:

Filter for ADC samples in example dataset PXD000438.

>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp.samples["condition"] = ["ADC"] * 6 + ["SCC"] * 6
>>> sp = sp.filter_samples(expr="condition=='ADC'")

`import_from_delim_file(path, delim, data_selector=None, feature_selector=None, replace_val_with_nan=None, clean_column_names=False)` `classmethod` ¶

Alternative constructor from file. Import method that reads data directly from delimited file, including feature data, feature metadata, and sample metadata, and assigns them to data, features, and samples attributes automatically. This is intended for convenient import of standardized outputs like MaxQuant's proteingroups.txt or FragPipe/DIA-NN's diann-output.pg_matrix.tsv. Method uses ColumnSelector objects to assign columns to their relevant storage containers. If no ColumnSelector is provided, the function defaults to assigning all numerical (float64) columns to data and all string (object) columns to feature data. Thus, it is best to explicitly state which columns to import. The samples attribute is automatically populated with the column names from data. The original data row and columns indices are stored in features and samples 'orig_index' variables, resp., for bookkeeping. The read path and delimiter used will be appended to the history attribute. Values in data can be replaced with NaN to indicate missingness (e.g. intensity values of 0). Column names can be automatically cleaned.

Parameters:

path (str) –

Path to file to read in.
delim (str) –

Delimiter to parse file (e.g. ' ' for .txt or ',' for .csv).
data_selector (ColumnSelector, default: None ) –

A ColumnSelector object with specified names or regex patterns to extract data columns in file. If not specified, defaults to all 'number' dtype columns.
feature_selector (ColumnSelector, default: None ) –

A ColumnSelector object with specified names or regex patterns to extract feature metadata columns in file. If not specified, defaults to all object dtype columns.
replace_val_with_nan (float, default: None ) –

A float numeric value in data to replace with np.nan to indicate missingness (e.g. 0).
clean_column_names (bool, default: False ) –

Whether to clean column names in file before processing. Note that column selection happens on the cleaned column names! Thus, you have to account for this when instantiating the ColumnSelector object. Cleaning will coerce all string to lower case; spaces and hyphens will be replaced with underscores, and leading and trailing whitespace will be trimmed.

Returns:	`SummarizedPy` – A `SummarizedPy` object.

Examples:

Read in data from protein groups file (e.g. MQ or FragPipe) and construct SummarizedPy object. Default to placing all numerical columns in 'data', assocaited column names in 'samples', and object or string type columns in 'features'.

>>> import depy as dp
>>> pg_path = "~/path/to/my/proteingroups.txt"
>>> sp = dp.SummarizedPy().import_from_delim_file(path=pg_path, delim=' ', replace_val_with_nan=0., clean_column_names=True)

Select columns to import using ColumnSelector object. Assume data are in columns containing sub-string 'LFQ_intensity_'.

>>> import depy as dp
>>> pg_path = "~/path/to/my/proteingroups.txt"
>>> data = dp.ColumnSelector(regex="LFQ_intensity_")
>>> features = dp.ColumnSelector(names=["proteinID", "geneSymbol", "proteinDescription"])
>>> sp = dp.SummarizedPy().import_from_delim_file(path=pg_path, delim=' ', data_selector=data, feature_selector=features)

`impute_missing_values(method=None, extra_args=None)` ¶

Impute missing values using the ImputeLCMD R package.

Several common methods are available under the assumptions of: - MAR (KNN, SVD, MLE) - MNAR (QRILC, MinDet, MinProb) - Both MAR and MNAR (Hybrid)

Refer to the ImputeLCMD package documentation for further information: https://cran.r-project.org/web/packages/imputeLCMD/imputeLCMD.pdf

Parameters:

method ((Hybrid, KNN, SVD, MLE, QRILC, MinDet, MinProb), default: 'Hybrid' ) –

Imputation method to apply: - 'Hybrid' : Uses an empirical approach (quantile regression) to find a threshold below which missing values are imputed according to MNAR and above which values are imputed according to MAR. Defaults to mar='KNN' and mnar='QRILC'. - 'KNN' : Uses K-nearest neighbors to impute missing values under a MAR assumption. Defaults to k=15 neighbors. - 'SVD' : Uses Singular Value Decomposition to impute missing values under a MAR assumption. Defaults to k=2 principal components. - 'MLE' : Uses Maximum Likelihood Estimation (EM algorithm) to impute missing values under a MAR assumption. - 'QRILC' : Uses Quantile Regression for Imputation of Left Censored Data to impute missing values under an MNAR assumption. Defaults to tune_sigma=1 (SD of the MNAR distribution). - 'MinDet' : Uses imputation by minimum detected value under an MNAR assumption. Defaults to q=0.01 (quantile for minimum value estimation). - 'MinProb' : Uses imputation by random draws from a Gaussian distribution centered on the minimum value. Defaults to q=0.01 and tune_sigma=1.
extra_args (dict, default: None ) –
Used in conjunction with methods that take additional parameters.

Valid key-value pairs include:
- 'mar' : {'KNN', 'SVD', 'MLE'} with method='Hybrid'.
- 'mnar' : {'QRILC', 'MinDet', 'MinProb'} with `method='Hybrid'.
- 'k' : int Number of neighbors (KNN) or principal components (SVD).
- 'q' : float Quantile to estimate minimum value (MinDet, MinProb).
- 'tune_sigma' : float SD of the MNAR distribution (QRILC, MinProb).

Returns:	`SummarizedPy` – A `SummarizedPy` object with imputed missing values.

Raises:	`ValueError` – If an invalid value is supplied to 'method'.

Examples:

Impute missing values using ImputeLCMD's hybrid strategy. Use example dataset PXD000438 after filtering excessive missingness.

>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.filter_missingness(strategy="overall")
>>> sp = sp.impute_missing_values(method="Hybrid")

`limma_trend_dea(design_formula=None, contrasts=None, feature_id_col=None, robust=False, block=None, array_weights=False, extra_args=None)` ¶

Run differential expression analysis (DEA) with limma-trend. Limma powers its analyses by incorporating an empirical mean-variance trend estimated from the data as a prior. This alleviates the issue of estimating fold changes in the face of heteroscedasticity. In short, low-abundant features are prone to false positives due to inherently lower variance, whereas the opposite is true for high-abundant features, which are prone to false negatives. By modeling the overall mean-variance trend in the data and incorporating it as prior, information is shared across samples (which powers low-N designs) and features are effectively regularized. Compared to traditional parametric statistics, this Bayesian approach has consistently been found to be more powerful and achieve better FDR (false discovery rate) control. Additionally, a robust approximation can be used if the data contain hypo-/hypervariable features to avoid skewing the mean-variance trend. By fundamentally utilizing linear models, limma can accommodate complex designs, including fixed and random factors (i.e. mixed effects, such as nested factors or repeated measures) and their combination (i.e. to model between- and within-subjects designs). Limma can also incorporate sample quality weights, which are extremely powerful, especially in noisy datasets, which is often the case with human or animal samples. Using limma's arrayWeights function, samples are up- or down-weighted based on how variable they are compared to the average sample. Importantly, this function takes the experimental design into account when estimating the sample weights. Moreover, the user can provide an arbitrary design or none at all to estimate averaged weights for different groups of samples (i.e. in cases where sample quality is known to be especially poor according to some condition or technical covariate) or simply estimate sample-specific weights independent of design covariates. arrayWeights is run with the 'REML' method that allows for missing values. The weights are inversely proportional to sample variability (i.e. a weight of 0.5 is twice as variable as the average sample; weights >1 are less variable than the average and tend to reflect higher quality). The weights can be stabilized further and squeezed towards 1 by increasing the 'prior_n' parameter >10 (default); this tends to make weights more symmetric around 1 (average/equality), thus up- and down-weighting samples by similar magnitudes, rather disproportionately up-weighting good samples.

Parameters:

design_formula (str, default: None ) –

A formula describing the linear model. Covariates must be present in samples attribute. Must begin with a tilde (~) and add covariates with '+'. Note: formula may not contain intercept term
contrasts (dict, default: None ) –

A dictionary containing contrast labels (keys) and contrast definitions (values). Contrasts are defined by adding or subtracting levels of the covariates included in the design formula. Additional scaling factors are allowed, such as dividing by the number of included terms to get the average.
feature_id_col (str, default: None ) –

The name of a column in the features attribute to name features by. If None, the method defaults to naming features according to their index.
robust (bool, default: False ) –

Whether to run limma-trend with robust approximation.
block (str, default: None ) –

Name of a column in samples attribute to using as a blocking variable. This must be used if running a model with both between- and within-subjects factors. The blocking variable should correspond to the column (subject) that gave rise to repeated values.
array_weights (bool, default: False ) –

Whether to estimate sample quality weights.
extra_args (dict, default: None ) –
Used in conjunction with array_weights to specify additional arguments. Valid key-value pairs include:
- prior_n : int, The number of prior features to add (defaults to 10) to increase squeezing toward 1.
- var_group : str, Name of a column in samples indicating groups (levels) that should be assigned different average weights.
- sample_id_col : str, A column in samples attribute to use for sample labelling. This makes reading the sample weights output in limma_trend_dea.log easier. Note that names must be unique and may not start with numbers. If None, defaults to naming samples according to their index.

Returns:

SummarizedPy –
A SummarizedPy object with a results attribute containing limma-trend DEA results with columns:
- contrast_label : name of the contrast
- contrast : contrast definition
- feature : feature name
- logfc : log2 fold change (i.e. regression coefficient)
- ci_l : lower confidence interval for logfc
- ci_r : upper confidence interval for logfc
- aveexpr : average feature expression level
- t : t-value for the associated test
- p_value : nominal p-value for the associated test
- adj_p_val : Benjamini-Hochberg-based false discovery rate
- b : log-odds of differential expression

Raises:	`ValueError` – If no design_formula or contrasts arguments are provided.

Examples:

Full DEA pipeline on example dataset PXD000438.

>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp.samples["condition"] = ["ADC"] * 6 + ["SCC"] * 6 # Add condition variable
>>> sp = sp.filter_missingness(strategy="overall") # Pre-process
>>> sp = sp.transform_features(method="log", by=2)
>>> sp = sp.impute_missing_values(method="Hybrid")
>>> sp = sp.surrogate_variable_analysis(mod="~condition")
>>> des = "~condition+sv_1+sv_2+sv_3" # design formula (incl. 'condition' and surrogate variables)
>>> contr = {"SCCvsADC": "SCC-ADC"}  # define contrast (levels must be present in covariates above)
>>> sp = sp.limma_trend_dea(design_formula=des, contrasts=contr, array_weights=True) # with array_weights option
>>> sp.results # Check newly created results attribute

`load_example_data()` `classmethod` ¶

Load a real-world example proteomics dataset for demonstration purposes. The function loads dataset 'PXD000438' from the ImputeLCMD package. The data were generated from a super-SILAC experiment of human adenocacinoma (ADC) and squamous cell carcinoma (SCC) samples. The dataset contains six ADC and six SCC samples and 3,709 proteomic features with raw feature intensities and missing values. Samples 092.1-3 and 441.1-3 are ADC and 561.1-3 and 691.1-3 are SCC.

For more information about the dataset: https://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD000438

Returns:	`SummarizedPy` – A `SummarizedPy` object with 12 samples and 3,709 features.

Examples:

Load example dataset.

>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
<SummarizedPy(data=ndarray(shape=(3709, 12), dtype=float64), features=DataFrame(shape=(3709, 1)), samples=DataFrame(shape=(12, 1)))>

`load_sp(path=None)` `classmethod` ¶

Load a previously saved SummarizedPy object, stored as a pickle file on disk.

Parameters:	`path` (`str`, default: `None` ) – Path to stored pickle file.

Returns:	`SummarizedPy` – A `SummarizedPy` object.

Examples:

Load a saved SP object from pickle file.

>>> import depy as dp
>>> sp = dp.SummarizedPy().load_sp("my_sp.pkl")

`plot_pca(standardize=True, n_comp=None, fill_by=None, label=False)` ¶

Generate PCA plot of data using the first two principal components. PCA is computed using scikit-learn's PCA estimator with defaults. If data contain features with missing values, these will simply be omitted, as PCA requires complete data.

Parameters:

standardize (bool, default: True ) –

Whether to standardize (i.e. feature-wise z-scoring) data before computing PCA.
n_comp (int, default: None ) –

Number of principal components to calculate. Defaults to min(n_samples, n_features) as in sklearn.decimposition.PCA
fill_by (str, default: None ) –

Valid column name in samples attribute to use for coloring.
label (bool, default: False ) –

Whether to label points according sample variable in samples attribute.

Returns:	`tuple` – matplotlib figure and axes objects. fig : `.Figure` ax : `~matplotlib.axes.Axes`

Raises:	`ValueError` – If an invalid value is supplied to any of the arguments.

Examples:

Generate PCA plot on standardized data.

>>> sp.plot_pca(standardize=True)

`save_sp(path=None)` ¶

Save SummarizedPy object to disk using pickle for easy loading in the future. Method automatically appends '.pkl' to file.

Parameters:	`path` (`str`, default: `None` ) – Path to save pickle file.

Examples:

Save SP object to disk.

>>> sp.save_sp(path="my_sp")

`select_variable_features(top_n=None, top_percentile=None, plot=False)` ¶

Select highly variable features (HVF) based on deviation from data mean-variance trend. Uses LOWESS to fit a smooth trend to the feature-wise mean and standard deviation values. Note: if log2 transformation has not been applied using transform_features(method='log',by=2) it will be applied prior to fitting the mean-variance trend. Data will be returned on the original scale.

Parameters:

top_n (int, default: None ) –

Number of top variable features to return. Mutually exclusive with top_percentile.
top_percentile (int or float, default: None ) –

The top Nth percentile (i.e. 100-top_percentile) of variable features to return. Mutually exclusive with top_n.
plot (bool, default: False ) –

Whether to plot the fitted mean-variance trend and highlight HVFs.

Returns:	`SummarizedPy` – A `SummarizedPy` object.

Raises:	`ValueError` – If no valid `top_n` or `top_percentile` arguments are supplied.

Examples:

Select top 500 most variable features in example dataset PXD000438 and plot the fitted mean-variance trend.

>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.select_variable_features(top_n=500, plot=True)

`surrogate_variable_analysis(mod=None, mod0='~1', num_sv=None)` ¶

Run surrogate variable analysis to estimate latent factors that capture expression heterogeneity or hidden batch effects. The surrogate variables (SVs) will be added to the samples attribute and can be included as covariates in DEA. SVs are estimated through PCA on the residualized feature matrix after regressing out known experimental and technical/batch covariates. This is done by supplying the method with a fully parameterized model (mod), including all known covariates (experimental and technical; i.e. as present in the samples attribute), and a null model, including only technical (adjustment) covariates (mod0). The number of significant surrogate variables to estimate can then be specified using the num_sv argument; alternatively, the method can be run without specifying a number and allowing SVA to estimate the number empirically (using SVA's 'num.sv' function and the default 'leek' method). Note that this can return 0 SVs and fail. However, it is still possible to find significant SVs by forcing the method to run with a pre-specified num_sv argument. The mod and mod0 arguments must be specified using R formula formatting, which all start with a tilde (~) symbol and add covariates (+) and their interactions (*). Covariates must be present in the samples attribute. If no technical covariates are known, the method will run with the recommended default of "~1" (i.e. only using an intercept term). For more information, see: https://bioconductor.org/packages/3.19/bioc/vignettes/sva/inst/doc/sva.pdf

Parameters:

mod (str, default: None ) –

A formula describing the fully parameterized model (incl. all known covariates). Must begin with a tilde (~) and add covariates with '+'.
mod0 (str, default: '~1' ) –

A formula describing the null model (incl. all known adjustment covariates). Must begin with a tilde (~) and add covariates with '+'. Defaults to '~1'.
num_sv (int, default: None ) –

The number of significant surrogate variables to estimate.

Returns:	`SummarizedPy` – A `SummarizedPy` object with estimate surrogate variables in the samples attribute.

Raises:	`ValueError` – If no mod formula is supplied.

Examples:

Use SVA to estimate surrogate variables for inclusion in DEA. Use example dataset PXD000438: filter missing values and log2 transform features first.

>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.filter_missingness(strategy="overall") # Filter excessive missinginess (this is important)
>>> sp = sp.transform_features(method="log", by=2) # Log transform data (important)
>>> sp = sp.impute_missing_values(method="Hybrid") # Optionally, impute missing remaining values (sva excludes any feature with nan values)
>>> sp = sp.surrogate_variable_analysis(mod="~condition") # Default null model: mod0 = '~1' (intercept-only)
>>> sp.samples  # SVs now in samples attribute

`transform_features(method, axis=None, by=None)` ¶

transform_features(
    method: Literal["center"],
    axis: int,
    by: Optional[Literal["mean", "median"]] = None,
) -> SummarizedPy

transform_features(
    method: Literal["log"], by: Optional[int] = None
) -> SummarizedPy

transform_features(
    method: Literal["z-score"], axis: int, by: None = None
) -> SummarizedPy

transform_features(method: Literal['vsn']) -> SummarizedPy

Mathematically transform features stored in data attribute using one of: log (base N), center (using mean or median subtraction), or z-score (standardize).

Parameters:

axis (int, default: None ) –
An axis to perform the transformation along:
- 0 = rows (per-feature)
- 1 = columns (per-sample).
method ((log, center, z - score, vsn), default: 'log' ) –
Mathematical transformation to apply:
- 'log' : Applies log base N (use by parameter to set base) transformation across entire data array.
- 'center' : Center data by subtraction (use in conjunction with by parameter).
- 'z-score' : Standardizes data using z-score transformation (i.e. (x_i-x_mean)/x_std). NB: applies Bessel's N-1 correction to estimate sample standard deviation.
- vsn : Applies variance stabilizing normalization, as implemented Huber et al. (2004).
by (str or int, default: None ) –
Used in conjunction with method='center' or method='log'.
- With 'center' : str, One of 'mean' or 'median' to subtract axis mean or median from each cell.
- With 'log' : int, An integer for the base of the logarithm (defaults to 2).

Returns:	`SummarizedPy` – A transformed SummarizedPy object.

Raises:	`ValueError` – If an invalid value is supplied to either axis, method, or by.

Examples:

Transform feature data in example dataset PXD000438.

>>> import depy as dp
>>> sp = dp.SummarizedPy().load_example_data()
>>> sp = sp.transform_features(method="log", by=2) # Log transformation (base 2)
>>> sp = sp.transform_features(method="center", by="median", axis=1) # Center data sample-wise by median
>>> sp = sp.transform_features(method="z-score", axis=0) # Feature-wise standardization
>>> sp = sp.transform_features(method="vsn") # vsn normalization

`volcano_plot(contrasts=None, top_n=3, de_colors=None)` ¶

Generate volcano plots for limma-trend results, highlighting the top N up- and downregulated features.

Parameters:

contrasts (list, default: None ) –

A list of strings referring to the name of contrasts to plot (i.e. column 'contrast_label' in results attribute). Defaults to all contrasts.
top_n (int, default: 3 ) –

Number of top up- and down-regulated features to highlight, ranked by adjusted p-value (FDR). Note: top_n up- and top_n downregulated features will be displayed, rather top_n in total. Defaults to top 3.
de_colors (dict, default: None ) –

Colors to use for upregulated, downregulated, and non-significant features. Must supply 'Up', 'Down', and 'ns' as keys with associated colors (str).

Returns:	`dict` – Dictionaries of matplotlib figure and axes objects for each contrast. fig : `.Figure` ax : `~matplotlib.axes.Axes`

Raises:	`ValueError` – If an invalid value is supplied to either contrasts or de_colors.

Examples:

Generate volcano plots after running limma_trend_dea method.

>>> sp.volcano_plot()
>>> sp.volcano_plot(contrasts=["SCCvsADC"])

`ColumnSelector` ¶

Object for pre-selecting columns when constructing SummarizedPy from file.

Parameters:	`names` (`list`, default: `None` ) – A list of strings matching column names. `regex` (`str`, default: `None` ) – A string that can be interpreted as a regular expression by re.search. Note that the search is case-insensitive.

Examples:

Define columns to import from file. Assume data are in columns labeled "LFQ_intensity_*".

>>> import depy as dp
>>> data = dp.ColumnSelector(regex="LFQ_intensity_")
>>> features = dp.ColumnSelector(names=["proteinID", "geneSymbol", "proteinDescription"])
>>> sp = dp.SummarizedPy().import_from_delim_file(path="my/path/proteingroups.txt", delim=' ', data_selector=data, feature_selector=features)

`select_cols(df)` ¶

Parameters:	`df` (`DataFrame`) – A Pandas `DataFrame` object on which to perform column selection.

Returns:	`DataFrame` – A pandas `DataFrame` with selected columns.

Raises:	`KeyError` – If no valid column names are supplied.

API Reference¶

SummarizedPy ¶

filter_features(expr=None, mask=None) ¶

filter_missingness(frac=0.75, strategy='all_conditions', condition_column=None) ¶

filter_samples(expr=None, mask=None) ¶

import_from_delim_file(path, delim, data_selector=None, feature_selector=None, replace_val_with_nan=None, clean_column_names=False) classmethod ¶

impute_missing_values(method=None, extra_args=None) ¶

limma_trend_dea(design_formula=None, contrasts=None, feature_id_col=None, robust=False, block=None, array_weights=False, extra_args=None) ¶

load_example_data() classmethod ¶

load_sp(path=None) classmethod ¶

plot_pca(standardize=True, n_comp=None, fill_by=None, label=False) ¶

save_sp(path=None) ¶

select_variable_features(top_n=None, top_percentile=None, plot=False) ¶

surrogate_variable_analysis(mod=None, mod0='~1', num_sv=None) ¶

transform_features(method, axis=None, by=None) ¶

volcano_plot(contrasts=None, top_n=3, de_colors=None) ¶

ColumnSelector ¶

select_cols(df) ¶

`SummarizedPy` ¶

`filter_features(expr=None, mask=None)` ¶

`filter_missingness(frac=0.75, strategy='all_conditions', condition_column=None)` ¶

`filter_samples(expr=None, mask=None)` ¶

`import_from_delim_file(path, delim, data_selector=None, feature_selector=None, replace_val_with_nan=None, clean_column_names=False)` `classmethod` ¶

`impute_missing_values(method=None, extra_args=None)` ¶

`limma_trend_dea(design_formula=None, contrasts=None, feature_id_col=None, robust=False, block=None, array_weights=False, extra_args=None)` ¶

`load_example_data()` `classmethod` ¶

`load_sp(path=None)` `classmethod` ¶

`plot_pca(standardize=True, n_comp=None, fill_by=None, label=False)` ¶

`save_sp(path=None)` ¶

`select_variable_features(top_n=None, top_percentile=None, plot=False)` ¶

`surrogate_variable_analysis(mod=None, mod0='~1', num_sv=None)` ¶

`transform_features(method, axis=None, by=None)` ¶

`volcano_plot(contrasts=None, top_n=3, de_colors=None)` ¶

`ColumnSelector` ¶

`select_cols(df)` ¶