Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility on speeding the calculations up? #19

Open
AquifersBSIM opened this issue Dec 5, 2024 · 4 comments
Open

Possibility on speeding the calculations up? #19

AquifersBSIM opened this issue Dec 5, 2024 · 4 comments

Comments

@AquifersBSIM
Copy link

AquifersBSIM commented Dec 5, 2024

Hi all, I have a question to ask, is there a possibility that we can actually speed the calculations up? It's awesome that mordred is still maintained!

Why speed calculations up?
I have about 90,000 molecules to calculate chemical descriptors and it takes somewhere between 4 hours to 5 hours.

The code

def calculate_3D_function(mols, input_filename):
    calc_3D = Calculator(descriptors, ignore_3D=False)  # Initialise 3D descriptors

    # Calculate descriptors
    print(f"Calculating 3D descriptors for {input_filename}...")

    # Start timer
    start_time = time.time()

    df = calc_3D.pandas(mols)
    print(df.head())  # Display the top rows
    
    # Create output filename based on input filename
    output_filename = f"{os.path.splitext(input_filename)[0]}_descriptors.csv"
    df.to_csv(output_filename, index=False)  # Save to CSV
    print(f"Descriptors saved to {output_filename}")

    # Calculate elapsed time
    elapsed_time = time.time() - start_time
    print(f"Processing completed in {elapsed_time:.2f} seconds.\n")
@JacksonBurns
Copy link
Owner

There's no easy way to make it faster. It's highly parallelized, so if you have access to a machine with more CPU cores it will speed up, but that doesn't really count.

We could do some profiling to find which descriptors are the slowest to calculate and then try and speed those up, if you are interested!

@AquifersBSIM
Copy link
Author

Hi @JacksonBurns ! Thank you so much for the reply and suggestions. Would it be too troublesome for you to do profiling to find which descriptors are the slowest to calculate?

@JacksonBurns
Copy link
Owner

Sure - I put together this small demo (mordred_profile.json) which you download, change the extension to .ipynb and then open as a jupyter notebook. About half of the execution time is actually spent entering and exiting context managers (this is terrible). This can be mostly be attributed to this method:

    @classmethod
    def from_query(cls, mol, require_3D, explicit_hydrogens, kekulizes, id, config):
        if not isinstance(mol, Chem.Mol):
            raise TypeError("{!r} is not rdkit.Chem.Mol instance".format(mol))

        n_frags = len(Chem.GetMolFrags(mol))

        if mol.HasProp("_Name"):
            name = mol.GetProp("_Name")
        else:
            name = Chem.MolToSmiles(Chem.RemoveHs(mol, updateExplicitCount=True))

        mols, coords = {}, {}

        for eh, ke in ((eh, ke) for eh in explicit_hydrogens for ke in kekulizes):
            m = Chem.AddHs(mol) if eh else Chem.RemoveHs(mol, updateExplicitCount=True)

            if ke:
                Chem.Kekulize(m)

            if require_3D:
                try:
                    conf = m.GetConformer(id)
                    if conf.Is3D():
                        coords[eh, ke] = conformer_to_numpy(conf)
                except ValueError:
                    pass

            m.RemoveAllConformers()
            mols[eh, ke] = m

        return cls(mols, coords, n_frags, name, config)

inside mordred/_base/context.py which is spending a lot of time operating on the input molecules. Perhaps you can find a way to reduce the time spent in this method? I think the name method and all of the rdkit Chem operations are probably expensive.

@AquifersBSIM
Copy link
Author

Hi @JacksonBurns, Thank you for the demo! I will have a look at this and hopefully come back with good news! I actually tried to utilize more CPU cores, and it actually sped up quite well (a rough estimation would be x10), but you're right, it doesn't really count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants