Skip to content

Commit

Permalink
Consolidate Dataflow documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
drewnoakes committed Oct 25, 2023
1 parent 65efb02 commit c872f28
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 86 deletions.
1 change: 0 additions & 1 deletion doc/Index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ VS Project System Documentation
- [Responsive design](overview/responsive_design.md)
- [Globbing behavior](overview/globbing_behavior.md)
- [Dataflow](overview/dataflow.md)
- [Dataflow in CPS](overview/dataflow_in_CPS.md)
- [Dataflow source blocks](extensibility/dataflow_sources.md)
- Diagnostics
- [How to examine Visual Studio registry](overview/examine_registry.md)
Expand Down
16 changes: 13 additions & 3 deletions doc/overview/dataflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,8 @@ CPS has a few subclasses of `DataflowLinkOptions` that you can use in certain ci

# Dataflow in CPS

One of the main goals of CPS is to move the bulk of the project system work to background threads while still maintaining data consistency. To accomplish this, CPS leverages Dataflow to produce a versioned, immutable, producer-consumer pattern to flow changes through the project system. Dataflow is not always easy, and if used wrongly can lead to corrupt state and deadlocks.

## Slim blocks

TPL's Dataflow blocks are general purpose and have feautres that aren't used in CPS. Those unused features come with a performance/memory cost. To improve the scalability of CPS in large solutions, we have a replacement set of "slim" blocks that provide the required behaviours of TPL's blocks, but without the overhead associated with the unused features.
Expand All @@ -245,7 +247,9 @@ TPL's Dataflow blocks are general purpose and have feautres that aren't used in

Dataflow graphs publish immutable snapshots of data between blocks, where updates are pushed through the graph in an asynchronous fashion. This gives the framework a lot of flexibility to schedule the work, but can make it difficult to know when a given input has made its way through the graph to the outputs.

Another challenge with Dataflow graphs is joining data. Consider the following graph:
Dataflow is simple when you have a single line of dependencies, but in CPS it is much more complex. It is common for a chained datasource to require input from multiple upstream sources. It is also common for those upstream sources to also have multiple inputs. This pattern introduces a data consistency problem.

Consider the following Dataflow graph:

```mermaid
flowchart LR
Expand All @@ -271,7 +275,7 @@ public interface IProjectValueVersions
}
```

And in fact, a versioned value can have _more than one version!_ This makes sense when you consider that a given node in the graph can have more than one source block feeding in to it. Each of those source blocks provides its own versioned value, and as messages are joined, the sets of versions are merged.
And in fact, a versioned value can have _more than one version!_ This makes sense when you consider that a given node in the graph can have more than one source block feeding in to it. Each of those source blocks provides its own versioned value, and as messages are joined the sets of versions are merged.

```mermaid
flowchart LR
Expand Down Expand Up @@ -347,6 +351,12 @@ IDisposable link = ProjectDataSources.SyncLinkTo(

The `SyncLinkOptions` extension method allows the data source to be configured. If the source contains rule-based data (discussed [below](#rule-sources))

### Allowing inconsistent versions

In special cases that require it, it is possible to allow for inconsistent versions in your Dataflow. This is for when you depend on multiple upstream sources where one is drastically slower at producing values than others, but you want to be able to produce intermediate values while the slow one is still processing. An example of this is where you want data quickly from project evaluation, and also want the richer data that arrives later via design-time builds.

Unfortunately, there is no built-in support for this scenario. You will have to manually link to your upstream sources and synchronize between them. When producing chained output, to calculate the data versions to publish you may be able to use `ProjectDataSources.MergeDataSourceVersions`.

## Subscribing to project data

One of the main use cases for Dataflow in CPS is the processing of project data. Unlike the legacy CSPROJ project system where updates were generally applied on a single thread (the main thread), CPS uses Dataflow to schedule updates asyncrhonously on the thread pool.
Expand Down Expand Up @@ -476,7 +486,7 @@ CPS provides access to several such `IProjectValueDataSource<T>` instances via `

### Chained (derived) data sources

Most `IProjectValueDataSource<T>` instances will produce data that was derived from other project value data sources. CPS provides the abstract base class `ChainedProjectValueDataSourceBase<T>`, which makes creating such a derived (chained) source easy.
Most `IProjectValueDataSource<T>` instances will produce data that was derived from one or more other project value data sources. CPS provides the abstract base class `ChainedProjectValueDataSourceBase<T>`, which makes creating such a derived (chained) source easy.

Let's look at an example of overriding this class to create a new data source that derives its data from one other source:

Expand Down
83 changes: 1 addition & 82 deletions doc/overview/dataflow_in_CPS.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,3 @@
# Dataflow in CPS

One of the main goals of CPS is to move the bulk of the project system work to background threads,
while still maintaining data consistency. To accomplish this, CPS leverages the [TPL.Dataflow](https://learn.microsoft.com/dotnet/standard/parallel-programming/dataflow-task-parallel-library)
library to produce a versioned, immutable, producer-consumer pattern to flow changes through the
project system. Dataflow is not always easy, and if used wrong it can quickly lead to corrupt
project states or deadlocks.

## Types of Dataflow in CPS

Dataflow in CPS comes primarily in two types, an original source or a chained source.

1. Original Source
* Depends on an original source of data that is not part of dataflow.
* Always has its own version.
* IE: a file on disk
2. Chained Source
* Chains into existing dataflow.
* Can be one or multiple dataflow blocks that feed into this one.
* Very __rarely__ has its own version. Typically if it does, it can
be pulled out into an original source.
* Carries all the versions of the dataflow it chains to.
* More about versioning later

## Data Consistency Problem

Dataflow is simple when you have a single line of dependencies, but in CPS it is much more complex.
It is common for a chained datasource to require input from multiple upstream sources. It is also
common for those upstream sources to also have multiple inputs. This pattern introduces a data
consistency problem. Take a look at the dataflow diagram below (arrows represent dataflow):

```mermaid
flowchart LR
A --> C
C --> D
B --> C
B --> D
```

In the above layout, `A` and `B` are original sources. `C` listens to both `A` and `B`, but since
they are _original_ sources `C` can produce a new value when either change. `D` is where it gets
complex. `D` can only produce values when it has `B` and `C` of the same source version. `D` only
produces a value when the version of `C` it has was produced from the same version of `B` that
`D` currently has. To solve this consistency issue CPS versions all dataflow and then synchronizes
around these published versions.

## Dataflow Versioning

To solve the problem described above, all dataflow in CPS produces types of `IProjectVersionedValue<T>`.
This type combines `T Value` and `IImmutableDictionary<NamedIdentity, IComparable> DataSourceVersions`.

Then, chained dataflow will cary the versions of its upstream data sources. When a chained source has
multiple upstream sources its published version becomes the merged value of the its upstream sources.
This functionality is facilitated via `ProjectDataSources.SyncLinkTo`. When using that method to link
to multiple upstream sources, a middle dataflow block is created that only publishes to your block when
all recieved values are in a consistent state. See [this example](../extensibility/dataflow_example.md#chained-data-source-multiple-sources)
for how to use `SyncLinkTo`.

### Rules to Follow with Versioning

__When you are a...__
* __Original source__ you have your own `DataSourceKey` and `DataSourceVersion`. The key
identifies who you are, and the version must incremenet whenever you produce a new value.
The only value in your `DataSourceVersions` published is your own.
* __Chained source__ you must merge and carry the versions of all dataflow you are chained
to in your own `DataSourceVersions`. You very rarely have your own version because your
version is just the combined versions that you chained to. If you do need your own version,
consider pulling the part that publishes the original data into its own source.

### Allowing Inconsistent Versions

In special cases that require it, it is possible to allow for inconsistent versions in your dataflow.
This is for when you depend on multiple upstream sources where one is drastically slower at producing
values than others, but you want to be able to produce intermediate values while the slow one is still
processing. Unfortunately, there is no CPS base class equivalent to `ProjectValueDataSourceBase` or
`ChainedProjectValueDataSourceBase` for this scenario. You will have to manually link to your upstream
sources and synchronizing between multiple sources publishing at once. For calculating the data versions
to publish, use `ProjectDataSources.MergeDataSourceVersions`.

## Further reading

- [Dataflow Examples](../extensibility/dataflow_example.md)
- [Dataflow Sources](../extensibility/dataflow_sources.md)
- [Dataflow Best Practices](../extensibility/dataflow_best_practices.md)
Moved to [Dataflow in CPS](dataflow#dataflow-in-cps).

0 comments on commit c872f28

Please sign in to comment.