The data were first filtered based on the label-free quantification intensities (LFQi) using the following five steps: (i) removal of proteins that were labeled as “only identified by site”, “potential contaminant”, and “reverse”; (ii) removal of all observations with LFQi equals to 0; (iii) removal of outlier samples (based on low overall LFQi; see Fig S3); (iv) removal of proteins that are not present in at least 60% of the samples of a group for each group (a group is defined as the collection of three biological with two technical replicates for one condition, which results in a group size of maximum 6); and (v) filtering against the negative control sample, which is only the beads used for the AP-MS sample preparations, by only considering proteins for further analysis that are significantly higher found in the samples compared with the negative control. In MS analysis–based proteomics data, there are typically two types of missing values, the missing not at random (MNAR) and the missing at random (MAR) (Lazar et al, 2016 (link)). A mixed imputation strategy was chosen, with kNN imputation as the strategy for MAR values (Gatto & Lilley, 2012 (link); Gatto et al, 2021 (link); Rainer et al, 2022 (link)). Other missing values were considered MNAR values and imputed at value 0. After the imputation, differential interaction analysis was performed for each group against the bead control. P-values were adjusted using FDR correction as described by Benjamini and Hochberg (1995) (link). Afterward, all proteins were extracted for each group, which were significantly enriched in the sample (cutoffs: P-value–adjusted: <0.01, log fold change: >1). The data were transformed to have consistent protein and gene name annotations after the data filtering. The data are received from MaxQuant software in UniProt IDs and mapped to HGNC gene names using the HGNC database (retrieved 12/2021). However, one UniProt ID can correspond to multiple HGNC gene names. In this case, manual selection of the gene names of interest was performed. Finally, the HGNC names were mapped to gene IDs of the SysGO database (Luthert & Kiel, 2020 (link)). A couple of proteins could not be found in the SysGO database, and one protein was renamed (i.e., HGNC name: PHB1, which was renamed PHD for SysGO). Then, the technical replicates were merged using the median. In summary, we obtain a dataset with raw LFQi (Table S2) or log2-transformed (Table S3) data with biological triplicates. Data preparation was performed in R (http://www.r-project.org/index.html) using the following packages: dplyr (Beckerman et al, 2017 ), tidyr (Wickham et al, 2019 (link)), stringr (Wickham, 2010 (link)), tidyxl, purr (Mailund, 2019 ), DEP (Zhang et al, 2018 (link)), and limma (Ritchie et al, 2015 (link); Phipson et al, 2016 (link)). The script file for the data preparation and the data pre- and post-preparation are available on Zenodo (Camille et al, 2022 (link)).

Table S2. Raw AP-MS LFQ intensity data with biological triplicates.

Table S3. Log2-transformed AP-MS data with biological triplicates.

Free full text: Click here