rcx_tk.msdial

Functions

process_msdial_file(→ None)

Process MSDial output file to group duplicate alignments.

get_n_samples(→ int)

Obtain number of samples from msdial file.

process_msdial(→ pandas.DataFrame)

Function to process a DataFrame of MSDial results to group duplicate alignments.

refine(→ list[pandas.Index])

Refine clusters based on mz tolerance, splitting them if the quant mass is different.

aggregations(→ dict[str, collections.abc.Callable])

Generate aggregation functions based on column types.

find_clusters(→ list[pandas.Index])

Transitive merging of all duplicate indices into groups, where groups are merged if there is any overlap.

union(→ pandas.Index)

Function to combine list of indices to union index.

find_all_duplicates(→ list[pandas.Index])

Get index of any duplicate values in any column.

Module Contents

rcx_tk.msdial.process_msdial_file(file_path: str, out_path: str, mz_tol_ppm: int) None[source]

Process MSDial output file to group duplicate alignments.

Parameters:
  • file_path (str) – Input file path.

  • out_path (str) – Output file path.

  • mz_tol_ppm (int) – m/z tolerance in ppm to use for splitting clustered alignments.

rcx_tk.msdial.get_n_samples(file_path: str) int[source]

Obtain number of samples from msdial file.

Parameters:

file_path (str) – Path to msdial file.

Returns:

Number of samples contained in the file.

Return type:

int

rcx_tk.msdial.process_msdial(df: pandas.DataFrame, n_samples: int, mz_tol_ppm: int, metadata_cols: int = 27, index_col: str = 'Alignment ID') pandas.DataFrame[source]

Function to process a DataFrame of MSDial results to group duplicate alignments.

Parameters:
  • df (pd.DataFrame) – Dataframe with MSDial results.

  • n_samples (int) – Number of samples - required to determine number of intensity cols in df.

  • mz_tol_ppm (int) – m/z tolerance in ppm to use for splitting clustered alignments.

  • metadata_cols (int, optional) – Number of columns containing data prior to feature abundances. Defaults to 27.

  • index_col (str, optional) – Column to denote the index. Defaults to “Alignment ID”.

Returns:

DataFrame with clustered alignment ids.

Return type:

pd.DataFrame

rcx_tk.msdial.refine(clusters: list[pandas.Index], metadata: pandas.DataFrame, mz_tol_ppm: int) list[pandas.Index][source]

Refine clusters based on mz tolerance, splitting them if the quant mass is different.

Parameters:
  • clusters (list[pd.Index]) – List of clusters to refine.

  • metadata (pd.DataFrame) – Metadata section of the msdial file to use for refining clusters.

  • mz_tol_ppm (int) – m/z tolerance in ppm to use to split clusters.

Returns:

Refined list of clusters.

Return type:

list[pd.Index]

rcx_tk.msdial.aggregations(mean_columns: list[str], concat_columns: list[str], abundance_columns: list[str]) dict[str, collections.abc.Callable][source]

Generate aggregation functions based on column types.

Parameters:
  • mean_columns (list[str]) – List of columns to aggregate using mean.

  • concat_columns (list[str]) – List of columns to aggregate using concatenation.

  • abundance_columns (list[str]) – List of columns to aggregate using max.

Returns:

Dictionary with functions to use for pd.aggregate

Return type:

dict[str, function]

rcx_tk.msdial.find_clusters(all_duplicates: list[pandas.Index]) list[pandas.Index][source]

Transitive merging of all duplicate indices into groups, where groups are merged if there is any overlap.

Parameters:

all_duplicates (list[pd.Index]) – List of all duplicate indices.

Returns:

Clusters of connected duplicates.

Return type:

list[pd.Index]

rcx_tk.msdial.union(all_duplicates: list[pandas.Index]) pandas.Index[source]

Function to combine list of indices to union index.

Parameters:

all_duplicates (list[pd.Index]) – All indices to combine.

Returns:

Union of all indices.

Return type:

pd.Index

rcx_tk.msdial.find_all_duplicates(data_matrix: pandas.DataFrame) list[pandas.Index][source]

Get index of any duplicate values in any column.

Parameters:

data_matrix (pd.DataFrame) – DataFrame to check column-by-column for duplicate values.

Returns:

All indexes of duplicates.

Return type:

list[pd.Index]