PyPI license PyPI Python version PyPI version PyPI status

BioFlow Project

Information Flow Analysis in biological networks

Build Status Documentation Status Coverage Status

Overview:

BioFlow is a quantitative systems biology tool.

It leverages information complexity theory in order to generate mechanism-of-action hypotheses in a robust, quantified, highly granular, explainable, customizeable and near-optimal manner.

The core of the BioFlow is a biological knowledge entity relationship graph. It is derived from existing biological knowledge repositories, such as Reactome, Gene Ontology, HINT, Phosphosite, ComplexPortal or BioGrid. Upon importing them, BioFlow converts relationship

The biological knowledge entities from different repositories are cross-referenced and connections between them established based on how likely they are to be directly connected in an hypothesis for a general mechanism of action.

The input for BioFlow is a list of physical entities most likely involved in a process of interest - usually a list of genes or proteins.

BioFlow then examines all the possible paths linking those physical entities of interest and assigns an overall probability to all the paths, then proceeding to analyse the most likely common denominators for paths and assigning to each biological knowledge entity and relationship between them a chance of being part of the mechanism-of-action behind the process.

In order to avoid picking up on spurious associations, the resulting weights for biological knowledge entity and relationships between them are compared to the ones generated by random groups of genes and only based on that the final probabilities of inclusion into hypotheses for the mechanism-of-action are calculated.

The final probabilities for the knowledge entities are sorted by the probability of inclusion and are printed to the terminal and saved as a .tsv file, whereas the entire network with probabilities assigned to entities and relationships between them is saved for further analysis as a .gdf file.

BioFlow is:

  • robust, because we are weighting the relationships between the knowledge entities based on how likely they are to be real and be included in a mechanism of action. Thanks to that, we can include relationships we are not sure of (eg Y2H protein-protein interactions with low confidence score) and by assigning them a low weight make sure they will not be included into mechanisms of action hypothesis unless no other possible connection can explain a phenotype.
  • quantified, because every single biological knowledge entity is assigned both a weight and a p-value for how likely it is to contribute to a mechanism of action.
  • **highly granular because thanks to the weighting of edges and inclusion of multiple sources of biological knowledge, we can evaluate in the same pass steps for which we have a very good mecanistical understanding (eg specific cite phosphorilation on a specific isoform by another specific isoform) as well as steps for which we are not entirely convinced of in-vivo existence (once again, low score Y2H). Similarly, thanks to a unified model of biological entity
  • explainable, because all the steps in the hypothesis are biological entities and connections between them. As such, a direct translation from a BioFlow analysis to an experiment is “let’s suppress this entity/relationship with a high p-value and a high weight” and see if it affects our process.
  • customizable, because BioFlow is build as a library with multiple abstraction levels and customization capabilities.
    • The knowledge graph can be directly accessed and modified through the graphical user interface provided by the neo4j graph database storage back-end, as well as extracted as scipy.sparse weighted graph laplacian and adjacency matrices with index-to-entity-id maps.
    • By adding new rdf turtle tuple parsers into the bioflow.bio_db_parser and inserters into the bioflow.db_importers, new sources of biological knowledge can be integrated.
    • By modifying routines in the bioflow.algorithms_bank, new entity relationship weighting modes, background sampling algorithms or evaluation methods of hypotheses statistical significance can be introduced.
    • In case of absolute need, alternative storage backends can be implemented by re-implementing the GraphDBPipe object in bioflow.neo4j_db.cypher_drivers or methods from bioflow.sample_storage.mongodb.
  • near-optimal, because we are using a finite-world version of Solomonoff Algorithmic Inference - a provably optimal way of learning model representation from data. In order to accelerate the computation and stabilize the output with respect to the noise often encountered in biological data, we slightly modify the inference mode and background computation algorithm, hence the “near”-optimality.

Examples of applications:

A good way of thinking about BioFlow is as a specialized search engine for biological knowledge

Molecular mechanism hypotheses from high-throughput experiment results:

A lot of modern biology exploring phenotypes of interest from an unbiased perspective. In order to get an initial idea of a mechanism, a standard approach is to retrieve several experimental models that present a deviation of a phenotype from the model and perform a high-throughput experiment, such as mutagenisis, mutant library screen, transcription profiling or proteomics profiling. A result is often a list of hundreds of genes that are potentially involved in a mechanism, but it is rare that a mechanism is directly visible or even if there is any certainty that the list is not mostly composed of experimental artefacts.

BioFlow is capable of generating an unbiased, quantified list of hypotheses as to the molecular mechanism underlying the process.

  • Thanks to its integrated knowledge graph, it is capable to implicate mechanisms that would not be detectable by the screening method used to generate input data.
  • Thanks to its parallel evaluation of possible molecular mechanisms, it can point to backup mechanisms as well as molecular entities or pathways weekly involved in the process of interest.
  • Thanks to its null model of a random list of genes, it is capable to filter out random nodes that are due to artifacts.
  • Thanks to its null model, if the provided list of genes is a pure artefact, it will not call any nodes as likely paths and mark all hypotheses of molecular mechanisms as insufficiently likely.

Personalized cancer medicine:

Cancers are characterized by a large variety of mutation affecting numerous pathways in the bioilogical organisms. Some of the effects are antagonistic, some synergetic, but a lot of mutations/expression modifications are variations affecting similar pathways.

BioFlow is capable of integrating the list of perturbations found in the patients cancer (such as mutations, transcriptional modification, protein trafficking imbalance, …) and build a model of perturbed molecular pathways in the patient, allowing the drugs and drug combination selection to be prioritized.

Evaluation of the effects of large-scale genome perturbation:

In some cases, such as partial genome duplication or aneuploidy, or a recombination event, a large number of gene expression levels or protein structures are perturbed simultaneously. While models might exist for single genes or small groups of genes, the sheer number of perturbations and ways in which they can interfere makes the prediction intractable for humans.

In case any group of perturbation is likely to have a major synergistic effect, BioFlow will highlight them as well as the likely molecular mechanism they could act through.

In-silico drug secondary effects prediction:

Given that a lot of small-molecule drugs possess a polypharmacological multi-target activity, a number of secondary effects that cannot be traced to single targets being poisoned by the compound or its metabolic derivatives, are hypothesized to be related to the systemic effects of unspecific binding of the compound.

By combining the list of compounds that have been associated to a specific secondary effect as well as the binding profiles for those compounds (either measured in-vitro or simulated in-silico), BioFlow can create the network of the nodes and relationships most likely implicated in a mechanism of action underlying the secondary effect induction by off-target binding. By comparing the network flow generated by a de-novo compound binding profile to the ones engendered by the drugs presenting the secondary effect or not, we can infer most likely secondary effects to prioritize donwstream pre-clinical testing.

In-silico drug repurposing:

One of the advantages of the readily approved drugs is that they have been shown to be relatively safe in humans and have well-understood secondary effects. As such, their application in treatments of novel diseases is significantly more desireable than development of new compounds, both due a shorter time of development and only efficacy trails in humans being needed. This application is particularly interesting for rare and neglected diseases, where a de-novo compound development is usually not economically viable.

BioFlow can be used to construct the profile of biological entities that are most likely to be implicated in the molecular mechanism of action behind the disease (such as deviation from nor in human rare genetic disorders or essential pathways in pathogens). Based on the in-silico or in-vivo binding assays of approved compounds to the targets relevant to the phenotype, BioFlow can help to prioritize the compounds for further investigation.

Relationship to other methods:

Network diffusive models:

BioFlow generalizes the network diffusion models. While both BioFlow and network diffusion models rely on the graph laplacian and the flow through it, BioFlow uses that formalization to rank the most possible molecular mechanisms in a maximally unbiased, nearly optimal manner, whereas most network diffusion models work by pure analogy.

BioFlow’s near-optimality provides an explanation to the uncanny efficiency of the graph diffusion models, but in addition provides a direct interpretability of the results as well as suggests schemes for weighting the graph edges.

Network topology methods:

Compared to the network topology methods, thanks to its weighting scheme and all-paths probability evaluation, BioFlow is much less brittle to the inclusion of low-confidence edges affecting biological topology. It allows for multiple abstraction levels to be examined simultaneously, which is particularly interested in cases where a granular information is available only for a small subset of nodes and edges. In turn, this capability to work with multiple levels of abstraction granularity allows BioFlow to work with heterogeneous data, integrating different types of perturbation at the same time.

Annotation group methods:

Given that BioFlow does not rely on strict borders between categories, uses weights and simultaneously evaluates all the possible molecular mechanisms based on the data, it is significantly less brittle with regards to specific molecular entity inclusion in the annotation groups or the inclusion of specific molecular entities into the list associated to the process of interest.

Similarly, when BioFlow analyses the possible hypotheses for human-generated annotation networks, it provides much more interepretability with regards to the connection to the annotation proximity of terms, taking in account annotation terms overloading, single molecular entity over-annotation or confidence of molecular entity annotation with an annotation term.

Finally, by combining multiple annotation networks analysis with the molecular entity network, it is less prone to neglecting processes that have not yet been annotated in the annotation network.

Mechanistic models:

Given that BioFlow is capable to operate with multiple levels of granularity of biological knowledge abstractions and uses a unified model for all the molecular entities and relationships between them, BioFlow is able to work with many more data types, does not require exact knowledge to generate hypotheses and is computationally more simple and stable at scale. Similarly, it is able to suggest possible mechanisms where there are yet no mechanistic models.

However, BioFlow’s model means that it has a more restrictetd expressivity. For instance, it will not be able to recognize synergistic vs antagonistic interaction between perturbations it is analyzing or distinguish repressors from inducers.

Overall, BioFlow is a good precursor to mechanistic models, if nodes that and interactions that are ranked highly by BioFlow have a strong overlap with known mechanistic models.

Functioning and specifics of the implementation:

BioFlow requires an instance of neo4j graph database running for the main knowledge repository, as well as an instance of the MongoDB.

Upon start, BioFlow will look for $BIOFLOWHOME environment variable to know where to store its files. If none found, it will use the default ~/bioflow directory.

Inside the $BIOFLOWHOME it will store the user configs .yaml file ($BIOFLOWHOME/configs/main_configs.yaml). If for whatever reason it doesn’t find them, it will copy the default configs it has there. If you want to reset configs to default, just delete or rename your config yaml file.

The config contains several sections:

  • DB_locations: maps where to look for the databases it uses to build the main biological entity relationship graph and where to store them locally. If you get an error on download, chances are one of the source databases has moved. Alternatively, if you want to use a specific snapshot of the database, you can change the online location the file is loaded from.
  • Servers: stores the urls and ports BioFlow will expect MongoDB and Neo4j to be available.
  • Sources: allows to select the organism. If you are not sure of what you are doing, just uncomment the organism you want to work on.
  • User_settings:
    • smtp_logging: enable and configure if you want to receive notifications about errors or run finishing by mail. Given you will need a local smtp server sending mails properly, setting this section is not for the faint of heart.
    • environement: modifies how some aspects of BioFlow work. Comments explain what it does, but you will need to understand the inner workings of BioFlow to know how it works.
    • analysis: controls the parameters used to calculate statistical significance
    • debug_flags: potentially useful if you want to debug an issue or fill out a bug report.

Everything is logged to $BIOFLOWHOME/.internal/logs. As such, debugs, critical errors and warnings are all stored there.

Upon execution a run output folder in $BIOFLOWHOME/outputs/ is created with the datetime ISO name that will contain any output generated by the run, as well as info-level log (basically, a copy of what is printed on the console).

Finally, due to large differences in topological structure and weighting algorithms, the analysis of biological knowledge nodes that represent molecular entities (proteins, isoforms, small molecules) and the ones that represent human-made abstractions to reason about them (Gene Ontology terms, Pathways, …) are split into two different modules (molecular network/Interactome vs annotation network/BioKnowledge modules/classes).

The full API documentation is available at readthedocs.org.

Basic Usage:

Installation walk-through:

Ubuntu direct installation:

1) Install the Anaconda python 3.X and use the python provided by Anaconda python in all that follows. A way of doing it is by Making Anacoda Python your default Python. The full process is explained here.

  1. Install libsuitesparse:

    > apt-get -y install libsuitesparse-dev
    
  2. Install neo4j:

    > wget -O - https://debian.neo4j.org/neotechnology.gpg.key | sudo apt-key add -
    > echo 'deb https://debian.neo4j.org/repo stable/' | sudo tee /etc/apt/sources.list.d/neo4j.list
    > sudo apt-get update
    > sudo apt-get install neo4j
    
  3. Install MongDB (Assuming Linux 18.04 - if not, see here):

    > sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
    > echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
    > sudo apt-get update
    > sudo apt-get install -y mongodb-org
    

For more information, refer to the installation guide

  1. Finally, install BioFlow through pip:

    > pip install BioFlow
    

Or, if you want to install it directly:

> git clone https://github.com/chiffa/BioFlow.git
> cd <installation directory/BioFlow>
> pip install -r requirements.txt

Docker:

If you want to build locally (notice you need to issue docker commands with the actual docker-enabled user; usually prepending sudo to the commands):

> git clone https://github.com/chiffa/BioFlow.git
> cd <BioFlow installation folder>
> docker build -t "bioflow" .
> docker run bioflow
> docker-compose build
> docker-compose up -d

If you want to pull from dockerhub or don’t have access to BioFlow installation directory:

> wget https://github.com/chiffa/BioFlow/blob/master/docker-compose.yml
> mkdir -p $BIOFLOWHOME/input
> mkdir -p $BIOFLOWHOME/source
> mkdir -p $BIOFLOWHOME/.internal/docker-mongo/db-data
> mkdir -p $BIOFLOWHOME/.internal/docker-neo4j/db-data
> docker-compose build
> docker-compose up -d

Finally attach to the running container:

> docker attach bioflow_bioflow_1

For working from docker, you will have to have $BIOFLOWHOME environment variable defined (by default $HOME/bioflow).

Scripts with which docker build was tested can be found in the docker_script.sh file.

For persistent storage, the data will be stored in the mapped volumes as follows:

Volume mapping
What Docker On disk
neo4j data /data (neo4j docker) $BIOFLOWHOME/.internal/docker-neo4j/db-data
mongodb data /data/db (mongodb docker) $BIOFLOWHOME/.internal/docker-mongo/db-data
bioflow home /bioflow $BIOFLOWHOME
inputs /input $BIOFLOWHOME/input

Usage walk-through:

Warning

While BioFlow provides an interface to download the databases programmatically, the databases are subject to Licenses and Terms that it’s up to the end users to respect

For more information about data and config files, refer to the data and database guide

Python scripts:

This is the recommended method for using BioFlow.

An example usage script is provided by bioflow.analysis_pipeline_example.py.

First, let’s pull the online databases:

> from bioflow.utils.source_dbs_download import pull_online_dbs
> pull_online_dbs()

Now, we can import the main knowledge repository database handlers and build the main database:

> from bioflow.db_importers.import_main import destroy_db, build_db
> build_db()

The building process will take a bit - up to a couple of hours.

Now, you can start using the BioFlow proper:

> from bioflow.annotation_network.knowledge_access_analysis import auto_analyze as \
>    knowledge_analysis, _filter
> from bioflow.molecular_network.interactome_analysis import auto_analyze as interactome_analysis

> hits_file = "/your/path/here.tsv"
> background_file = "/your/other_path/here.tsv"

And get to work: map the hits and the background genes to internal db IDs:

> hit_ids, background_ids = map_and_save_gene_ids(hits_file, background_file)

BioFlow expects the tsv/csv for hits or background files to contain one hit per line, and will attempt to map them to UNIPROT protein nodes (used as a backbone to cross-link imported databases), based on the following identifier types:

  • Gene names
  • HGCN symbols
  • PDB Ids
  • ENSEMBL Ids
  • RefSeq IDs
  • Uniprot IDs
  • Uniprot accession numbers

(Re)build the laplacians (not required unless the knowledge structure in the main knowledge database has changed):

> rebuild_the_laplacians()

Launch the analysis itself for the information flow in the interactome:

> interactome_analysis(source_list=[hits_ids],
                       output_destinations=[`<name_of_experiment>`],     #optional
                       desired_depth=20,                                 #optional
                       processors=3,                                     #optional
                       background_list=background_bulbs_ids,             #optional
                       skip_sampling=False)                              #optional

Launch the analysis itself for the information flow in the annotation network (experimental):

> knowledge_analysis(source_list=[hits_ids],
                     output_destinations=[`<name_of_experiment>`],     #optional
                     desired_depth=20,                                 #optional
                     processors=3,                                     #optional
                     background_list=background_bulbs_ids,             #optional
                     skip_sampling=False)                              #optional

Where:

hits_ids:list of hits
output_destinations:
 names to provide for the output destination (by default numbered from 0)
desired_depth:how many samples we would like to generate to compare against
processors:how many threads we would like to launch in parallel (in general 3/4 works best)
background_list:
 list of background Ids
skip_sampling:if true, skips the sampling of background set and retrieves stored ones instead

BioFlow will print progress to the StdErr from then on and will output to the $BIOFLOWHOME, in a folder called outputs_YYYY-MM_DD <launch time>:

  • .gdf file with the flow network and relevance statistics (Interactome_Analysis_output.gdf)
  • visualisation of information flow through nodes in the null vs hits sets based on the node degree
  • list of strongest hits (interactome_stats.tsv) (printed to StdOut as well)

The .gdf file can be further analysed with more appropriate tools, such as for instance Gephi.

Enabling the SMTP logging would require you to manually build a try-except around your script code:

> from bioflow.utils.smtp_log_behavior import get_smtp_logger, started_process, \
successfully_completed, smtp_error_bail_out

> try:
>   started_process()
> except Exception as e:
>   smtp_error_bail_out()
>   raise e

> try:
>   <your code here>
> except Exception as e:
>   try:
>       logger.exception(e)
>   except Exception as e:
>        smtp_error_bail_out()
>        raise e
>   raise e

> else:
>   try:
>       successfully_completed()
>   except Exception as e:
>       smtp_error_bail_out()
>       raise e

Command line:

Command line can either be invoked by python execution:

> python -m bioflow.cli <command> [--options]

Or, in case of installation with pip, directly from a command line (assumed here):

> bioflow <command> [--options]

Setup environment (likely to take a while top pull all the online databases):

> bioflow downloaddbs
> bioflow loadneo4j

Set the set of perturbed proteins on which we would want to base our analysis

> bioflow mapsource /your/path/here.tsv --background=/your/other_path/here.tsv

Rebuild the laplacians

> bioflow rebuildlaplacians

Perform the analysis:

> bioflow analyze --matrix interactome --depth 24 --processors 3 --background True
                --name=<name_of_experiment>

> bioflow analyze --matrix annotome --depth 24 --processors 3 --background True
                --name=<name_of_experiment>

Alternatively:

> bioflow analyze --depth 24 --processors 3 --background True --name=<name_of_experiment>

More information is available with:

> bioflow --help

> bioflow about

The results of analysis will be available in the output folder, and printed out to the StdOut.

Post-processing:

The .gdf file format is one of the standard format for graph exchange. It contains the following columns for the nodes:

  • node ID
  • information current passing through the node
  • node type
  • legacy_id
  • degree of the node
  • whether it is present or not in the hits list (source)
  • p-value, comparing the information flow through the node to the flow expected for the random

set of genes - -log10(p_value) (p_p-value) - rel_value (information flow relative to the flow expected for a random set of genes) - std_diff (how many standard deviations above the flow for a random set of genes the flow from a hits list is) (not a robust metrics)

The most common pipleine involves using Gephi open graph visualization platform:

  • Load the gdf file into gephy
  • Filter out all the nodes with information flow below 0.05 (Filters > Atrributes > Range > current)
  • Perform clustering (Statistics > Modularity > Randomize & use weights)
  • Filter out all the nodes below a significance threshold (Filters > Attributes > Range > p-value)
  • Set Color nodes based on the Modularity Class (Nodes > Colors > Partition > Modularity Class)
  • Set node size based on p_p-value (Nodes > Size > Ranking > p_p-value )
  • Set text color based on whether the node is in the hits list (Nodes > Text Color > Partition > source)
  • Set text size based on p_p-value (Nodes > Text Size > Ranking > p_p-value)
  • Show the lables (T on the bottom left)
  • Set labes to the legacy IDs (Notepad on the bottom)
  • Perform a ForeAtlas Node Separation (Layout > Force Atlas 2 > Dissuade Hubs & Prevent Overlap)
  • Adjust label size
  • Adjust labels position (Layout > LabelAdjust)

For more details or usage as a library, refer to the usage guide.

API documentation:

Indices and tables: