alan-turing-institute
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 9 additions & 3 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 9 deletions b/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 9 deletions
diff --git a/‎docs/source/health_data.rst‎
Lines changed: 82 additions & 25 deletions b/‎docs/source/health_data.rst‎
Lines changed: 82 additions & 25 deletions
diff --git a/‎docs/source/index.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/index.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/installation.rst‎
Lines changed: 0 additions & 3 deletions b/‎docs/source/installation.rst‎
Lines changed: 0 additions & 3 deletions
@@ -49,28 +49,34 @@ repos:
         types: ['python']
         exclude: (?x)(
           tests/examples|
-          tests/workspace
+          tests/workspace|
+          examples
           )
       - id: pylint
         name: Pylint
         entry: poetry run pylint
         language: system
         types: ['python']
+        exclude: (?x)(
+          examples/
+          )
       - id: pydocstyle
         name: pydocstyle
         entry: poetry run pydocstyle
         language: system
         types: ['python']
         exclude: (?x)(
           docs/|
-          tests/
+          tests/|
+          examples/
           )
       - id: mypy
         name: mypy
         entry: poetry run mypy --follow-imports=silent
         language: system
         exclude: (?x)(
           tests/examples|
-          tests/workspace
+          tests/workspace|
+          examples
           )
         types: ['python']
@@ -25,15 +25,6 @@ Please install the following software on your workstation:
     poetry install --all-extras
     ```
 
-   If Poetry errors when installing PyYaml, you will need to manually specify the Cython version and manually install PyYaml (this is a temporary workaround for a PyYaml v5 conflict with Cython v3, see [here](https://github.com/yaml/pyyaml/issues/601) for full details):
-
-    ```bash
-    poetry run pip install "cython<3"
-    poetry run pip install wheel
-    poetry run pip install --no-build-isolation "pyyaml==5.4.1"
-    poetry install --all-extras
-    ```
-
     *If you don't need to [build the project documentation](#building-documentation-locally), the `--all-extras` option can be omitted.*
 
 1. Install the git hook scripts. They will run whenever you perform a commit:
@@ -86,6 +77,7 @@ Functional tests require a PostgreSQL service running. Perform the following ste
 
     ```bash
     createdb dst
+    PGPASSWORD=password psql --host=localhost --username=postgres --file=tests/examples/dst.dump
     ```
 
 1. Finally, run the functional tests. You will need the environment variable `REQUIRES_DB` with a value of `1`.
 
@@ -187,58 +187,115 @@ Here is our config for the person table:
               "ethnicity_source_concept_id",
             ]
 
-        - name: row_generators.make_null
+        - name: generic.null_provider.null
           columns_assigned: person_source_value
-        - name: row_generators.make_null
+        - name: generic.null_provider.null
           columns_assigned: provider_id
-        - name: row_generators.make_null
+        - name: generic.null_provider.null
           columns_assigned: care_site_id
 
 ``num_rows_per_pass`` is set to 0, because all rows are generated by the story generator.
 Let's use the gender columns as an emxample.
-Here are the relevant functions from ``row_generators.py``.
+Here is the relevant function from ``row_generators.py``.
 
 .. code-block:: python
 
-  def sample_from_sql_group_by(
-      group_by_result, weights_column, value_columns=None, filter_dict=None
-  ):
+  def gender(generic: Generic, src_stats: SrcStats) -> GenderRows:
+      """Generate values for the four gender columns of the OMOP schema.
+
+      Samples from the src_stats["count_gender"] result.
+      """
+      return cast(
+          GenderRows,
+          generic.sql_group_by_provider.sample(
+              src_stats["count_gender"],
+              "num",
+              value_columns=[
+                  "gender_concept_id",
+                  "gender_source_value",
+                  "gender_source_concept_id",
+              ],
+          ),
+      )
+
+Clearly this just off-loads all the work onto ``generic.sql_group_by_provider.sample``, which is a built-in provided by SSG.
+Let's take a look at its source code from ``providers.py``:
+
+.. code-block:: python
+
+  def sample(
+      self,
+      group_by_result: list[dict[str, Any]],
+      weights_column: str,
+      value_columns: Optional[Union[str, list[str]]] = None,
+      filter_dict: Optional[dict[str, Any]] = None,
+  ) -> Union[Any, dict[str, Any], tuple[Any, ...]]:
+      """Random sample a row from the result of a SQL `GROUP BY` query.
+
+      The result of the query is assumed to be in the format that sqlsynthgen's
+      make-stats outputs.
+
+      For example, if one executes the following src-stats query
+      ```
+      SELECT COUNT(*) AS num, nationality, gender, age
+      FROM person
+      GROUP BY nationality, gender, age
+      ```
+      and calls it the `count_demographics` query, one can then use
+      ```
+      generic.sql_group_by_provider.sample(
+          SRC_STATS["count_demographics"],
+          weights_column="num",
+          value_columns=["gender", "nationality"],
+          filter_dict={"age": 23},
+      )
+      ```
+      to restrict the results of the query to only people aged 23, and random sample a
+      pair of `gender` and `nationality` values (returned as a tuple in that order),
+      with the weights of the sampling given by the counts `num`.
+
+      Arguments:
+          group_by_result: Result of the query. A list of rows, with each row being a
+              dictionary with names of columns as keys.
+          weights_column: Name of the column which holds the weights based on which to
+              sample. Typically the result of a `COUNT(*)`.
+          value_columns: Name(s) of the column(s) to include in the result. Either a
+              string for a single column, an iterable of strings for multiple
+              columns, or `None` for all columns (default).
+          filter_dict: Dictionary of `{name_of_column: value_it_must_have}`, to
+              restrict the sampling to a subset of `group_by_result`. Optional.
+
+      Returns:
+          * a single value if `value_columns` is a single column name,
+          * a tuple of values in the same order as `value_columns` if `value_columns`
+            is an iterable of strings.
+          * a dictionary of {name_of_column: value} if `value_columns` is `None`
+      """
       if filter_dict is not None:
 
-          def filter_func(row):
-              for k, v in filter_dict.items():
-                  if row[k] != v:
+          def filter_func(row: dict) -> bool:
+              for key, value in filter_dict.items():
+                  if row[key] != value:
                       return False
               return True
 
           group_by_result = [row for row in group_by_result if filter_func(row)]
           if not group_by_result:
               raise ValueError("No group_by_result left after filter")
 
-      weights = [row[weights_column] for row in group_by_result]
+      weights = [cast(int, row[weights_column]) for row in group_by_result]
       weights = [w if w >= 0 else 1 for w in weights]
       random_choice = random.choices(group_by_result, weights)[0]
       if isinstance(value_columns, str):
           return random_choice[value_columns]
-      elif value_columns is not None:
+      if value_columns is not None:
           values = tuple(random_choice[col] for col in value_columns)
           return values
       return random_choice
 
-  def gender(generic, src_stats):
-      return sample_from_sql_group_by(
-          src_stats["count_gender"],
-          "num",
-          value_columns=[
-              "gender_concept_id",
-              "gender_source_value",
-              "gender_source_concept_id",
-          ],
-      )
-
-``sample_from_sql_group_by`` is a function we use a lot in this config, and in many others.
+The docstring explains the function quite well, but to reiterate:
 Its purpose is to take the output of a src-stats query that does a ``GROUP BY`` by some column(s) and a ``COUNT``, and sample a row from the results, with the sampling weights given by the counts.
-In this case we've done a ``GROUP BY`` over the three columns relating to gender, and thus are sampling from the same distribution of genders as in the source data, when creating our synthetic data.
+In our case we've done a ``GROUP BY`` over the three columns relating to gender, and thus are sampling from the same distribution of genders as in the source data, when creating our synthetic data.
 Note that this would also automatically replicate features such as ``NULL`` values or mismatches between the three gender columns, if they exist in the source data.
 The relevant source stats query is defined in this part of the config:
 
 
@@ -1,6 +1,6 @@
 .. _page-index:
 
-sqlsynthgen's Documentation
+sqlsynthgen
 ---------------------------
 
 **sqlsynthgen** is a package for making copies of relational databases and populating them with random data.
 
@@ -3,9 +3,6 @@
 Installation
 ============
 
-End User
---------
-
 To use SqlSynthGen, first install it:
 
 .. code-block:: console