Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ease the adding of new data #93

Open
konstantinstadler opened this issue Aug 3, 2021 · 3 comments
Open

Ease the adding of new data #93

konstantinstadler opened this issue Aug 3, 2021 · 3 comments
Assignees

Comments

@konstantinstadler
Copy link
Member

Currently, new data (country mapping) can only be passed as dataframe with minimum fields (name_short, name_official, regex).
See for example

https://gist.github.com/konstantinstadler/a8c1a651aeda5c67c4910325b8a9b466

Add a functionality to, for example, pass just a ISO2 to ISO3 mapping directly to the convert function (as dict).

@jm-rivera
Copy link
Contributor

@konstantinstadler I wrote a wrapper around the .convert method for this exact purpose. It works very similarly to the .pandas_convert method I added last year, but with the option to map additional data. Here's the basic implementation. I'd be happy to contribute it to the project if helpful.

def convert_id(
    series: pd.Series,
    from_type: str = "regex",
    to_type: str = "ISO3",
    not_found: str | None = None,
    *,
    additional_mapping: dict = None,
) -> pd.Series:
    """Takes a Pandas' series with country IDs and converts them into the desired type.

    Args:
        series: the Pandas series to convert
        from_type: the classification type according to which the series is encoded.
            Available types come from the country_converter package
            (https://github.com/konstantinstadler/country_converter#classification-schemes)
            For example: ISO3, ISO2, name_short, DACcode, etc.
        to_type: the target classification type. Same options as from_type
        not_found: what to do if the value is not found. Can pass a string or None.
            If None, the original value is passed through.
        additional_mapping: Optionally, a dictionary with additional mappings can be used.
            The keys are the values to be converted and the values are the converted values.
            The keys follow the same datatype as the original values. The values must follow
            the same datatype as the target type.
    """

    # if from and to are the same, return without changing anything
    if from_type == to_type:
        return series

    # Create convert object
    cc = coco.CountryConverter()

    # Get the unique values for mapping. This is done in order to significantly improve
    # the performance of country_converter with very long datasets.
    s_unique = series.unique()

    # Create a correspondence dictionary
    mapping = pd.Series(
        cc.convert(names=s_unique, src=from_type, to=to_type, not_found=nan),
        index=s_unique,
    ).to_dict()

    # If additional_mapping is passed, add to the mapping
    if additional_mapping is not None:
        mapping = mapping | additional_mapping

    return series.map(mapping).fillna(series if not_found is None else not_found)

For the main .covert it would require reworking the class a little bit. Also happy to contribute that, if helpful. Let me know!

@jm-rivera
Copy link
Contributor

@konstantinstadler just bumping this here if you think it would be helpful to implement.

@konstantinstadler
Copy link
Member Author

Amazing!

(Thanks for reminding/bumping that, first 3 month in 2023 were a bit crazy)

To make it simple we could have a separate function which takes a defined "from" and "to" argument, as in your proposal.
Adding it to the normal "convert" could cause problem due to the "smart" logic trying to guess the format.

But this could potentially be added to pandas_convert, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants