Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/refactor sources #48

Merged
merged 49 commits into from
Jan 23, 2025
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
9159407
Remove admin and datasets.toml
leo-mazzone Jan 16, 2025
6dd75c3
Rename exceptions to clarify which are server-side
leo-mazzone Jan 16, 2025
93e78d1
Split client and common source objects
leo-mazzone Jan 16, 2025
df55182
Update ORM
leo-mazzone Jan 16, 2025
078b3a6
Generalise get_model to get_resolution
leo-mazzone Jan 16, 2025
5dd1ff8
Start changing server-side query
leo-mazzone Jan 16, 2025
855b188
Generalise get_model in API
leo-mazzone Jan 16, 2025
75b08ab
Start moving codebase to using newer sources objects
leo-mazzone Jan 16, 2025
501f4ad
Solve a few problems with refactoring
leo-mazzone Jan 16, 2025
8693b19
Continue refactoring
leo-mazzone Jan 17, 2025
c371811
Finish working on common sources and continue refactor
leo-mazzone Jan 17, 2025
fd25f4f
Rework selectors and client-side query
leo-mazzone Jan 17, 2025
3d8cd44
Move Selector out of common
leo-mazzone Jan 17, 2025
f48f76b
Update abstract backend adapter
leo-mazzone Jan 17, 2025
f043c64
Make columns optional in Source and hash before serialising
leo-mazzone Jan 17, 2025
88c5bf0
Add get_dataset to backend
leo-mazzone Jan 17, 2025
0a00619
Store column data in plain text
leo-mazzone Jan 20, 2025
9277fef
Finish refactor, prior to testing
leo-mazzone Jan 20, 2025
25b347e
Finish refactor, prior to testing
leo-mazzone Jan 20, 2025
510b450
Update init files
leo-mazzone Jan 20, 2025
95abb16
Improve set_engine usability
leo-mazzone Jan 20, 2025
5ca0978
Fix benchmark query generation
leo-mazzone Jan 20, 2025
7585b7f
Fix some imports and types in tests
leo-mazzone Jan 20, 2025
51fd2d3
Use new client queries in adapter tests
leo-mazzone Jan 20, 2025
7b96f08
Get tests to compile
leo-mazzone Jan 20, 2025
fb9ce16
Make progress towards fixing tests
leo-mazzone Jan 20, 2025
f77a560
Solve various problems uncovered by tests
leo-mazzone Jan 20, 2025
47c069f
Re-introduce query without resolution in some cases
leo-mazzone Jan 20, 2025
5c187f0
Set dataset resolution names
leo-mazzone Jan 21, 2025
acf0393
Implement (terribly) col type on ORM
leo-mazzone Jan 21, 2025
ffd0959
Fix a few incorrect API calls
leo-mazzone Jan 21, 2025
659220b
Continue fixing tests
leo-mazzone Jan 21, 2025
a139713
Make all tests pass
leo-mazzone Jan 21, 2025
9955bdf
Merge branch 'main' into feature/refactor-sources
leo-mazzone Jan 21, 2025
914221d
Automatic reformatting of some docstrings
leo-mazzone Jan 21, 2025
66530f8
Update README, some docstrings and type hints
leo-mazzone Jan 21, 2025
6013b12
Revert using SourceAddress as API input
leo-mazzone Jan 22, 2025
77d3d19
Write some new tests
leo-mazzone Jan 22, 2025
157532a
Address PR comments
leo-mazzone Jan 22, 2025
2d04597
Solve final few problems
leo-mazzone Jan 22, 2025
95edf53
Complete client and common tests
leo-mazzone Jan 23, 2025
ee0e6fa
Complete all tests
leo-mazzone Jan 23, 2025
324e221
Add comments to tests
leo-mazzone Jan 23, 2025
ffa8499
Simplify index tests
leo-mazzone Jan 23, 2025
231db1b
Remove legacy function from top namespce
leo-mazzone Jan 23, 2025
7ce5086
Add docstrings
leo-mazzone Jan 23, 2025
9b7f656
Address final PR comments
leo-mazzone Jan 23, 2025
56b4c55
Update doc actions
leo-mazzone Jan 23, 2025
6fd7a94
Remove temporarily docs workflow
leo-mazzone Jan 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 0 additions & 53 deletions sample.datasets.toml

This file was deleted.

4 changes: 1 addition & 3 deletions src/matchbox/client/_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,7 @@


def url(path: str) -> str:
"""
Return path prefixed by API root, determined from environment
"""
"""Return path prefixed by API root, determined from environment"""
api_root = getenv("API__ROOT")
if api_root is None:
raise RuntimeError("API__ROOT needs to be defined in the environment")
Expand Down
59 changes: 0 additions & 59 deletions src/matchbox/client/admin.py

This file was deleted.

31 changes: 8 additions & 23 deletions src/matchbox/client/clean/lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,7 @@ def company_name(
column_secondary: str = None,
stopwords: str = cu.STOPWORDS,
) -> DataFrame:
"""
Lower case, remove punctuation & tokenise the company name into an array.
"""Lower case, remove punctuation & tokenise the company name into an array.
Extract tokens into: 'unusual' and 'stopwords'. Dedupe. Sort alphabetically.
Untokenise the unusual words back to a string.

Expand All @@ -26,7 +25,6 @@ def company_name(
Returns:
dataframe: the same as went in, but cleaned
"""

remove_stopwords = partial(steps.remove_stopwords, stopwords=stopwords)

clean_primary = cu.cleaning_function(
Expand All @@ -48,16 +46,14 @@ def company_name(


def company_number(df: DataFrame, column: str) -> DataFrame:
"""
Remove non-numbers, and then leading zeroes.
"""Remove non-numbers, and then leading zeroes.

Args:
df: a dataframe
column: a column containing a company number
Returns:
dataframe: the same as went in, but cleaned
"""

clean_number = cu.cleaning_function(steps.remove_notnumbers_leadingzeroes)

df = clean_number(df, column)
Expand All @@ -66,8 +62,7 @@ def company_number(df: DataFrame, column: str) -> DataFrame:


def postcode(df: DataFrame, column: str) -> DataFrame:
"""
Removes all punctuation, converts to upper, removes all spaces.
"""Removes all punctuation, converts to upper, removes all spaces.

Args:
df: a dataframe
Expand All @@ -76,7 +71,6 @@ def postcode(df: DataFrame, column: str) -> DataFrame:
dataframe: the same as went in, but cleaned

"""

clean_postcode = cu.cleaning_function(
steps.punctuation_to_spaces, steps.to_upper, steps.remove_whitespace
)
Expand All @@ -87,16 +81,14 @@ def postcode(df: DataFrame, column: str) -> DataFrame:


def postcode_to_area(df: DataFrame, column: str) -> DataFrame:
"""
Extracts postcode area from a postcode
"""Extracts postcode area from a postcode

Args:
df: a dataframe
column: a column containing a postcode
Returns:
dataframe: the same as went in, but cleaned
"""

extract_area = cu.cleaning_function(steps.get_postcode_area)

df = extract_area(df, column)
Expand All @@ -107,8 +99,7 @@ def postcode_to_area(df: DataFrame, column: str) -> DataFrame:
def extract_company_number_to_new(
df: DataFrame, column: str, new_column: str
) -> DataFrame:
"""
Detects the Companies House CRN in a column and moves it to a new column.
"""Detects the Companies House CRN in a column and moves it to a new column.

Args:
df: a dataframe
Expand All @@ -117,7 +108,6 @@ def extract_company_number_to_new(
Returns:
dataframe: the same as went in with a new column for CRNs
"""

clean_crn = cu.cleaning_function(
steps.clean_punctuation_except_hyphens,
steps.to_upper,
Expand All @@ -134,8 +124,7 @@ def extract_company_number_to_new(
def extract_duns_number_to_new(
df: DataFrame, column: str, new_column: str
) -> DataFrame:
"""
Detects the Dun & Bradstreet DUNS nuber in a column and moves it to
"""Detects the Dun & Bradstreet DUNS nuber in a column and moves it to
a new column.

Args:
Expand All @@ -145,7 +134,6 @@ def extract_duns_number_to_new(
Returns:
dataframe: the same as went in with a new column for DUNs numbers
"""

clean_duns = cu.cleaning_function(
steps.clean_punctuation_except_hyphens, steps.to_upper, steps.filter_duns_number
)
Expand All @@ -160,8 +148,7 @@ def extract_duns_number_to_new(
def extract_cdms_number_to_new(
df: DataFrame, column: str, new_column: str
) -> DataFrame:
"""
Detects the CDMS nuber in a column and moves it to a new column.
"""Detects the CDMS nuber in a column and moves it to a new column.

Args:
df: a dataframe
Expand All @@ -170,7 +157,6 @@ def extract_cdms_number_to_new(
Returns:
dataframe: the same as went in with a new column for CDMS numbers
"""

clean_cdms = cu.cleaning_function(
steps.clean_punctuation_except_hyphens, steps.to_upper, steps.filter_cdms_number
)
Expand All @@ -183,8 +169,7 @@ def extract_cdms_number_to_new(


def drop(df: DataFrame, column: str) -> DataFrame:
"""
Drops the column from the dataframe.
"""Drops the column from the dataframe.

Args:
df: a dataframe
Expand Down
Loading
Loading