Skip to content

Commit 1987946

Browse files
authored
Merge pull request #159 from alan-turing-institute/docsdocsdocs
Improve AirBnB tutorial
2 parents 3d2abed + fed42b2 commit 1987946

32 files changed

+1212
-1096
lines changed

.pre-commit-config.yaml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,28 +49,34 @@ repos:
4949
types: ['python']
5050
exclude: (?x)(
5151
tests/examples|
52-
tests/workspace
52+
tests/workspace|
53+
examples
5354
)
5455
- id: pylint
5556
name: Pylint
5657
entry: poetry run pylint
5758
language: system
5859
types: ['python']
60+
exclude: (?x)(
61+
examples/
62+
)
5963
- id: pydocstyle
6064
name: pydocstyle
6165
entry: poetry run pydocstyle
6266
language: system
6367
types: ['python']
6468
exclude: (?x)(
6569
docs/|
66-
tests/
70+
tests/|
71+
examples/
6772
)
6873
- id: mypy
6974
name: mypy
7075
entry: poetry run mypy --follow-imports=silent
7176
language: system
7277
exclude: (?x)(
7378
tests/examples|
74-
tests/workspace
79+
tests/workspace|
80+
examples
7581
)
7682
types: ['python']

CONTRIBUTING.md

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -25,15 +25,6 @@ Please install the following software on your workstation:
2525
poetry install --all-extras
2626
```
2727

28-
If Poetry errors when installing PyYaml, you will need to manually specify the Cython version and manually install PyYaml (this is a temporary workaround for a PyYaml v5 conflict with Cython v3, see [here](https://github.com/yaml/pyyaml/issues/601) for full details):
29-
30-
```bash
31-
poetry run pip install "cython<3"
32-
poetry run pip install wheel
33-
poetry run pip install --no-build-isolation "pyyaml==5.4.1"
34-
poetry install --all-extras
35-
```
36-
3728
*If you don't need to [build the project documentation](#building-documentation-locally), the `--all-extras` option can be omitted.*
3829
3930
1. Install the git hook scripts. They will run whenever you perform a commit:
@@ -86,6 +77,7 @@ Functional tests require a PostgreSQL service running. Perform the following ste
8677

8778
```bash
8879
createdb dst
80+
PGPASSWORD=password psql --host=localhost --username=postgres --file=tests/examples/dst.dump
8981
```
9082

9183
1. Finally, run the functional tests. You will need the environment variable `REQUIRES_DB` with a value of `1`.

docs/source/health_data.rst

Lines changed: 82 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -187,58 +187,115 @@ Here is our config for the person table:
187187
"ethnicity_source_concept_id",
188188
]
189189
190-
- name: row_generators.make_null
190+
- name: generic.null_provider.null
191191
columns_assigned: person_source_value
192-
- name: row_generators.make_null
192+
- name: generic.null_provider.null
193193
columns_assigned: provider_id
194-
- name: row_generators.make_null
194+
- name: generic.null_provider.null
195195
columns_assigned: care_site_id
196196
197197
``num_rows_per_pass`` is set to 0, because all rows are generated by the story generator.
198198
Let's use the gender columns as an emxample.
199-
Here are the relevant functions from ``row_generators.py``.
199+
Here is the relevant function from ``row_generators.py``.
200200

201201
.. code-block:: python
202202
203-
def sample_from_sql_group_by(
204-
group_by_result, weights_column, value_columns=None, filter_dict=None
205-
):
203+
def gender(generic: Generic, src_stats: SrcStats) -> GenderRows:
204+
"""Generate values for the four gender columns of the OMOP schema.
205+
206+
Samples from the src_stats["count_gender"] result.
207+
"""
208+
return cast(
209+
GenderRows,
210+
generic.sql_group_by_provider.sample(
211+
src_stats["count_gender"],
212+
"num",
213+
value_columns=[
214+
"gender_concept_id",
215+
"gender_source_value",
216+
"gender_source_concept_id",
217+
],
218+
),
219+
)
220+
221+
Clearly this just off-loads all the work onto ``generic.sql_group_by_provider.sample``, which is a built-in provided by SSG.
222+
Let's take a look at its source code from ``providers.py``:
223+
224+
.. code-block:: python
225+
226+
def sample(
227+
self,
228+
group_by_result: list[dict[str, Any]],
229+
weights_column: str,
230+
value_columns: Optional[Union[str, list[str]]] = None,
231+
filter_dict: Optional[dict[str, Any]] = None,
232+
) -> Union[Any, dict[str, Any], tuple[Any, ...]]:
233+
"""Random sample a row from the result of a SQL `GROUP BY` query.
234+
235+
The result of the query is assumed to be in the format that sqlsynthgen's
236+
make-stats outputs.
237+
238+
For example, if one executes the following src-stats query
239+
```
240+
SELECT COUNT(*) AS num, nationality, gender, age
241+
FROM person
242+
GROUP BY nationality, gender, age
243+
```
244+
and calls it the `count_demographics` query, one can then use
245+
```
246+
generic.sql_group_by_provider.sample(
247+
SRC_STATS["count_demographics"],
248+
weights_column="num",
249+
value_columns=["gender", "nationality"],
250+
filter_dict={"age": 23},
251+
)
252+
```
253+
to restrict the results of the query to only people aged 23, and random sample a
254+
pair of `gender` and `nationality` values (returned as a tuple in that order),
255+
with the weights of the sampling given by the counts `num`.
256+
257+
Arguments:
258+
group_by_result: Result of the query. A list of rows, with each row being a
259+
dictionary with names of columns as keys.
260+
weights_column: Name of the column which holds the weights based on which to
261+
sample. Typically the result of a `COUNT(*)`.
262+
value_columns: Name(s) of the column(s) to include in the result. Either a
263+
string for a single column, an iterable of strings for multiple
264+
columns, or `None` for all columns (default).
265+
filter_dict: Dictionary of `{name_of_column: value_it_must_have}`, to
266+
restrict the sampling to a subset of `group_by_result`. Optional.
267+
268+
Returns:
269+
* a single value if `value_columns` is a single column name,
270+
* a tuple of values in the same order as `value_columns` if `value_columns`
271+
is an iterable of strings.
272+
* a dictionary of {name_of_column: value} if `value_columns` is `None`
273+
"""
206274
if filter_dict is not None:
207275
208-
def filter_func(row):
209-
for k, v in filter_dict.items():
210-
if row[k] != v:
276+
def filter_func(row: dict) -> bool:
277+
for key, value in filter_dict.items():
278+
if row[key] != value:
211279
return False
212280
return True
213281
214282
group_by_result = [row for row in group_by_result if filter_func(row)]
215283
if not group_by_result:
216284
raise ValueError("No group_by_result left after filter")
217285
218-
weights = [row[weights_column] for row in group_by_result]
286+
weights = [cast(int, row[weights_column]) for row in group_by_result]
219287
weights = [w if w >= 0 else 1 for w in weights]
220288
random_choice = random.choices(group_by_result, weights)[0]
221289
if isinstance(value_columns, str):
222290
return random_choice[value_columns]
223-
elif value_columns is not None:
291+
if value_columns is not None:
224292
values = tuple(random_choice[col] for col in value_columns)
225293
return values
226294
return random_choice
227295
228-
def gender(generic, src_stats):
229-
return sample_from_sql_group_by(
230-
src_stats["count_gender"],
231-
"num",
232-
value_columns=[
233-
"gender_concept_id",
234-
"gender_source_value",
235-
"gender_source_concept_id",
236-
],
237-
)
238-
239-
``sample_from_sql_group_by`` is a function we use a lot in this config, and in many others.
296+
The docstring explains the function quite well, but to reiterate:
240297
Its purpose is to take the output of a src-stats query that does a ``GROUP BY`` by some column(s) and a ``COUNT``, and sample a row from the results, with the sampling weights given by the counts.
241-
In this case we've done a ``GROUP BY`` over the three columns relating to gender, and thus are sampling from the same distribution of genders as in the source data, when creating our synthetic data.
298+
In our case we've done a ``GROUP BY`` over the three columns relating to gender, and thus are sampling from the same distribution of genders as in the source data, when creating our synthetic data.
242299
Note that this would also automatically replicate features such as ``NULL`` values or mismatches between the three gender columns, if they exist in the source data.
243300
The relevant source stats query is defined in this part of the config:
244301

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
.. _page-index:
22

3-
sqlsynthgen's Documentation
3+
sqlsynthgen
44
---------------------------
55

66
**sqlsynthgen** is a package for making copies of relational databases and populating them with random data.

docs/source/installation.rst

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@
33
Installation
44
============
55

6-
End User
7-
--------
8-
96
To use SqlSynthGen, first install it:
107

118
.. code-block:: console

0 commit comments

Comments
 (0)