Data and databases:
===================

Adding new data to the main knowledge repository:
-------------------------------------------------
The easiest way to add new information to the main knowledge repository is by finding the nodes
to which new knowledge will attach (provided by the ``convert_to_internal_ids`` function from the
``bioflow.neo4j_db.db_io_routines`` module for a lot of x-ref identifiers for physical entity
nodes), and then process to add new relationships and nodes
using the functions ``DatabaseGraph.link`` to add a connection between nodes and ``DatabaseGraph
.create`` to add a new node. ``DatabaseGraph.attach_annotation_tag`` can be used in order to
attach annotation tags to new nodes that can be searcheable from the outside. All functions can
be batched (cf API documentation).

A new link will have a call signature of type ``link(node_id, node_id, link_type, {param: val})
``, where node_ids are internal database ids for the nodes provided by the
``convert_to_internal_ids`` function, ``link_type`` is a link type that would be handy for you to
remember (preferably in the snake_case). Two parameters are expected: ``source`` and
``parse_type``.  ``parse_type`` can only take a value in ``['physical_entity_molecular_interaction',
'identity', 'refines', 'annotates', 'annotation_relationship', 'xref']``, with ``'xref'`` being
reserved for the annotation linking.

A new node will have a call signature of type ``create(node_type, {param:val})`` and return the
internal id of the created node. ``node_type`` is a node type that would be handy for you to
remember (preferably in the snake_case). Four paramters are expected: ``'parse_type'``,
``'source'``, ``'legacyID'`` and ``'displayName'``. ``'parse_type'`` can take only values in
``['physical_entity', 'annotation', 'xref']``, with ``'xref'`` being reserved for the annotation
linking. ``legacyID`` is the identifier of the node in the source database and ``displayName`` is
the name of the biological knowledge node that that will be shown to the end user.


Input files:
============

Hits set:
---------
By default, BioFlow expects a set of ids of genes or other physical entities that have been
associated to a process or a phenotype of interest. In its current implementation, the
identifiers will be mapped to SWISSPROT-UNIPROT entities. The input file is expected to be a
.tsv/.csv file with a single identifier per line, without a header.

When evaluating the hits set, in order to determine if the hypothesis set (information flow
pattern) generated by it has any significance, BioFlow will randomly sample hypothesis sets
generated by matched random sets of valid physical entities (SWISSPROT-UNIPROT), and compare the
flow generate by the hits set to the one generated by the matched random sets.


Weighted hits set:
------------------
It is possible to specify weights assigned to each gene/protein that has been associated to
the process or phenotype of interest, indicating how much confidence we have in the fact that a
given gene/protein are indeed involved in a phenotype. For consistency with the theoretical
framework underlying BioFlow, the values should be log-likelihood. -log(p_values) could be a
reasonable weight. The input file is expected to be a .tsv/.csv file with a single identifier and
a positive weight per line, without a header.

When evaluating the significance, the matched random sampling will also match the weights of
randomly sampled nodes to the weights of real hits. By default, the matching is exact - the
weights from the sample will be permuted and randomly assigned to the sampled entities.

Due to the way sampling and statistical significance evaluation works, it is suggested that only
significant hits are supplied for the analysis.


Hits & targets set:
-------------------
In case when we are looking to generate hypotheses that would connect two sets of genes (such as
for instance genes experimentally identified as relevant to a phenotype and genes that have in
the past been documented as causative to a phenotype, we can supply a secondary hits set to
BioFlow.

In the current implementation, hits and secondary hits are assumed to be disjoint and the flow
will be calculated between them and only between them.

Similarly to just the primary hits set, both sets can be weighted. The file formatting is
identical to the one of the primary set.


Background:
-----------
By default, to calculate the flow significance, BioFlow samples all the valid identifiers in the
database - which in the current implementation is SWISSPROT-UNIPROT confirmed protein IDs.
However, for some experimental methods some proteins are effectively impossible to detect due to
the experimental modalities. When this is the case, the user can supply a file of identifiers
that define an explicit background set to be sampled from. The input file formatting is identical
to the primary set.


Weighted background:
--------------------
In case where the background genes are likely to be detected at random with different
probabilities, it is also possible to generate a weighted background, which will be sampling the
physical entities (in this case SWISSPOROT-UNIPROT) based on that weight. input file formatting
is identical to the weighted primary set as well.


File translation:
-----------------
The conversion from the external ids to the internal db ids before providing the input to the
Interactome/BioKnowledge interface to make sure the sampling sizes are correct in case some
external ids cannot be mapped to the internal ids. This is performed by the ``bioflow.top_level
.map_and_save_gene_ids``, which is quite polymorphic. It's general signatures are ::

    > hits_ids, sec_hit_ids, background_internal_ids = map_and_save_gene_ids(
         'yeast_test_gene_set-glycogen_biosynthesis.tsv',
         '')  # single input

    > hits_ids, sec_hit_ids, background_internal_ids = map_and_save_gene_ids(
         'yeast_test_gene_set-glycogen_biosynthesis.tsv',
         'test_weighted_background.tsv')  # single input with background

    > hits_ids, sec_hit_ids, background_internal_ids = map_and_save_gene_ids(
         ('yeast_test_gene_set-glycogen_biosynthesis_tsw_1.tsv',
          'yeast_test_gene_set-glycogen_biosynthesis_tsw_2.tsv'),
         '') # double input

    > hits_ids, sec_hit_ids, background_internal_ids = map_and_save_gene_ids(
         ('yeast_test_gene_set-glycogen_biosynthesis_tsw_1.tsv',
          'yeast_test_gene_set-glycogen_biosynthesis_tsw_2.tsv'),
         'test_weighted_background.tsv') # double input with background


``hits_ids``, ``sec_hit_ids`` and ``background_internal_ids`` can from then be used in auto_analysis
in Bioknowledge or Interactome modules


Node/edge weight adjustment:
----------------------------
For InteractomeInterface, it is possible to adjust the weights of nodes and edges (for instance
to rapidly compensate for hotspots due to identified biases in the flow patterns). It is not an
encouraged way of addressing biases. Ideally, the user should add the necessary information into
the main knowledge graph repository and adjust the weighting policy based on that knowledge.

In order to do it, a laplacian reweighting dictionnary can be applied to an ``InteractomeInterface``
instance through the ``InteractomeInterface.apply_reweight_dict(lapl_reweight_dict)`` method. To
do this, a reweight dictionary, mapping node ids to correction coefficients and node id pairs to
new edge weights is supplied as ``lapl_reweigth_dict``. New edge weights simply replace the old
ones, whereas the node correction coefficient are used as a multiplier for all edges going in and
coming out of the node to which it is applied. Note that internal db ids need to be supplied.

A way to supply a reweight dictionary is possible through the
``bioflow.molecular_network.interactome_analysis.auto_analyze`` through the
``forced_lapl_reweight`` parameter.

There is no similar function for BioKnowledge/GO in order to prevent the disruption of a
a graph connex only thanks to a single connection in several locations.


Modification of Policies:
=========================

A lot of internal behaviors are controlled by "policies", grouped inside ``bioflow
.algorithms_bank`` package. Each module in the package allows control of specific aspect of
knowledge graph construction and analysis. A bit more involved than modifying configs.yaml file
for basic configurations, most policies can be matched by functions with similar signatures and
provided as arguments to ``auto_analyze`` or ``InteractomeInterface`` or ``GOInterface`` instances.

Main knowledge graph parsing:
-----------------------------

Given the difference in the topology and potential differences in the underlying assumptions, we
pull the interactome knowledge network (where all nodes map to molecular entities and edges - to
physical/chemical interaction between them) and teh annotome knowledge network (where some nodes
might be concepts used to understand the biological systems - such as ontology terms or pathways)
separately.

The parse for interactome is performed by retrieving all the nodes and edges whose ``parse_type``
is ``physical_entity`` for nodes and ``physical_entity_molecular_interaction``, ``identity`` or
``refines``. The giant component of the interactome is then extracted and two graph matrices -
adjacency and laplacian - are build for it. Weights between the nodes are set in an additive
manner according to the policy supplied as the argument to the ``InteractomeInterafce
.full_rebuild`` function or, in a case a more granular approach is needed to the
``InteractomeInterafce.create_val_matrix`` function. By default the
``active_default_<adj/lapl>_weighting_policy`` functions are used from the
``bioflow.algorithms_bank.weigting_policies`` module. Resulting matrices are stored in the
``InteractomeInterface.adjacency_matrix`` and ``InteractomeInterface.laplacian_matrix`` instance
variables, whears the maps between the matrix indexes and maps are stored in the
``.neo4j_id_2_matrix_index`` and ``.matrix_index_2_neo4j_id`` variables.

The parse for the annotome is performed in the same way, but matching ``parse_type`` for nodes to
``physical_entity`` and ``annotation``. In case of a proper graph build, this will result only in
the edges of types ``annotates`` and ``annotation_relationship`` to be pulled. Weighting
functions are used in the similar manner, as well as the mappings storage.


Custom weighting function:
--------------------------
In order to account for different possible considerations when deciding which nodes and
connections are more likely to be included in hypothesis generation, we provide a possibility for
the end user to use their own weight functions for the interactome and the annotome.

The provided functions are stored in ``bioflow.algorithms_bank.weighting_policies`` module. An
expected signature of the function is ``starting_node, ending_node, edge > float``, where
``starting_node`` and ``ending_node`` are of ``<neo4j-driver>.Node`` type, whereas ``edge`` is of
the ``<neo4j-driver>.Edge`` type. Any properties available stored in the main knowledge
repository (neo4j database) will be available as dict-like properties of node/edge objects
(``<starting/ending>_node['<property>']``/``edge['property']``).

The functions are to be provided to the ``bioflow.molecular_network
.InteractomeInterface.InteractomeInterface.create_val_matrix()`` method as
``<adj/lapl>_weight_policy_function`` for the adjacency and laplacian matrices respectively.


Custom flow calculation function:
---------------------------------
In case a specific algorithms to generate pairs of nodes between which
to calculate the information flow is needed, it can be assigned to the ``InteractomeInterface
._flow_calculation_method``. It's call signature should conform to the ``list, list, int ->
list`` signature, where the return list is the list of pairs of ``(node_idx, weight)`` tuples. By
default, the ``general_flow`` method from ``bioflow.algorithms_bank.flow_calculation_methods``
will be used. It will try to match the expected flow calcualtion method based on the parameters
provided (connex within a set if the secondary set is empty/None, star-like if the secondary set
only has one element, biparty if the secondary set has more than one element).

Similarly, methods to evaluate the number of operations and to reduce their number
to a maximum ceiling with the optional int argument ``sparse_rounds`` needs to be assigned to the
``InteractomeInterface._ops_evaluation_method`` and ``InteractomeInterface
._ops_reduction_method``. By default, the are ``evaluate_ops`` and ``reduce_ops`` from
``bioflow.algorithms_bank.flow_calculation_methods``.


Custom random set sampling strategy:
------------------------------------
In case a custom algorithm for the generation of the background sample needs to be implemented,
it should be supplied to the ``InteractomeInterace.randomly_sample`` method as the
``sampling_policy`` argument.

It is expected to accept the an example of sample and secondary sample to match, background from
which to sample, number of samples desired and finally a single string parameter modifying the
way it works (supplied by the ``sampling_policy_options`` parameter of the
``InteractomeInterace.randomly_sample`` method).

By default, this functions implemented by the ``matched_sampling`` fundion in the
``bioflow.algorithms_bank.sampling_policies`` module.


Custom significance evaluation:
-------------------------------
by default, ``auto_analyze`` functions for the interactome and the annotome will use the default
``compare_to_blank`` functions and seek to determine the significance of flow based on comparison
of the flow achieved by nodes of a given degree in the real sample compared to the random "mock"
samples. The comparison will be performed using Gumbel_r function fitted to the highest flows
achieved by the "mock" runs.

As of now, to change the mode of statistical significance evaluation, a user will need to
re-implement the ``compare_to_blank`` functions and mokey-patch them in the modules containing
the ``auto_analyze`` function.


GDF files:
==========

BioKnowledgeInterface:
----------------------

In case of the uniprot nodes (used to start the analysis of the annotation, given the amount of
cross-referencing they have), ``confusion potential``, ``p-value`` and ``p_p-value`` variables
have no meaning. For the ease of the presentation, they have been hard-coded to comparatively
visible but non-obtrusive 1, 0.05 and 1.3 respectively. You might want to be aware when performing
your p-value cutoffs, or alternatively just reweight them in the program you  are using to
analyze the network.