You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+1-9Lines changed: 1 addition & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,15 +25,6 @@ Please install the following software on your workstation:
25
25
poetry install --all-extras
26
26
```
27
27
28
-
If Poetry errors when installing PyYaml, you will need to manually specify the Cython version and manually install PyYaml (this is a temporary workaround for a PyYaml v5 conflict with Cython v3, see [here](https://github.com/yaml/pyyaml/issues/601) for full details):
29
-
30
-
```bash
31
-
poetry run pip install "cython<3"
32
-
poetry run pip install wheel
33
-
poetry run pip install --no-build-isolation "pyyaml==5.4.1"
34
-
poetry install --all-extras
35
-
```
36
-
37
28
*If you don't need to [build the project documentation](#building-documentation-locally), the `--all-extras` option can be omitted.*
38
29
39
30
1. Install the git hook scripts. They will run whenever you perform a commit:
@@ -86,6 +77,7 @@ Functional tests require a PostgreSQL service running. Perform the following ste
values =tuple(random_choice[col] for col in value_columns)
225
293
return values
226
294
return random_choice
227
295
228
-
defgender(generic, src_stats):
229
-
return sample_from_sql_group_by(
230
-
src_stats["count_gender"],
231
-
"num",
232
-
value_columns=[
233
-
"gender_concept_id",
234
-
"gender_source_value",
235
-
"gender_source_concept_id",
236
-
],
237
-
)
238
-
239
-
``sample_from_sql_group_by`` is a function we use a lot in this config, and in many others.
296
+
The docstring explains the function quite well, but to reiterate:
240
297
Its purpose is to take the output of a src-stats query that does a ``GROUP BY`` by some column(s) and a ``COUNT``, and sample a row from the results, with the sampling weights given by the counts.
241
-
In this case we've done a ``GROUP BY`` over the three columns relating to gender, and thus are sampling from the same distribution of genders as in the source data, when creating our synthetic data.
298
+
In our case we've done a ``GROUP BY`` over the three columns relating to gender, and thus are sampling from the same distribution of genders as in the source data, when creating our synthetic data.
242
299
Note that this would also automatically replicate features such as ``NULL`` values or mismatches between the three gender columns, if they exist in the source data.
243
300
The relevant source stats query is defined in this part of the config:
0 commit comments