Skip to content

Commit

Permalink
Merge pull request #280 from poseidon-framework/isLatestOptimization
Browse files Browse the repository at this point in the history
V 1.4.0.3: Optimization of resolveEntityIndices
  • Loading branch information
nevrome authored Oct 27, 2023
2 parents 215e010 + 20acc01 commit f134faf
Show file tree
Hide file tree
Showing 17 changed files with 102 additions and 77 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/Dockerfile.centos
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ RUN yum -y update
RUN yum -y install zlib-devel wget ncurses-devel ncurses-compat-libs make gcc

# Install GHC since stack's local install has issues
RUN wget https://downloads.haskell.org/~ghc/9.2.7/ghc-9.2.7-x86_64-centos7-linux.tar.xz
RUN tar xvf ghc-9.2.7-x86_64-centos7-linux.tar.xz
RUN cd ghc-9.2.7; ./configure; make install
RUN wget https://downloads.haskell.org/~ghc/9.4.7/ghc-9.4.7-x86_64-centos7-linux.tar.xz
RUN tar xvf ghc-9.4.7-x86_64-centos7-linux.tar.xz
RUN cd ghc-9.4.7; ./configure; make install

# install stack
RUN curl -sSL https://get.haskellstack.org/ | sh
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
strategy:
matrix:
stack: ["latest"]
ghc: ["9.2.7"]
ghc: ["9.4.7"]

steps:
# setup and loading cache
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
matrix:
os: [ubuntu-20.04]
cabal: ["3.6"]
ghc: ["9.2.7"]
ghc: ["9.4.7"]

steps:
- name: Check out code
Expand Down Expand Up @@ -91,7 +91,7 @@ jobs:
matrix:
os: [macOS-latest]
cabal: ["3.6"]
ghc: ["9.2.7"]
ghc: ["9.4.7"]

steps:
- name: Check out code
Expand Down Expand Up @@ -156,7 +156,7 @@ jobs:
matrix:
os: [windows-latest]
cabal: ["3.6"]
ghc: ["9.2.7"]
ghc: ["9.4.7"]

steps:
- name: Check out code
Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
- V 1.4.0.3:
- Fixed a severe performance leak in code around `resolveEntityIndices`, which was called in various functions and wastefully recomputed `isLatestInCollection` way too often. This affected simple commands, like fetching a few packages from the server, forging, and has effects also in xerxes.
- Bumped to a newer Compiler (GHC 9.4.7) and new Stackage Snapshot (LTS-21.17)
- V 1.4.0.2:
- Strictly checking ploidy information across the .janno file and the genotype data in the package reading process has unforeseen consequences. Activating this will require some more changes, so we decided to uncomment this code for now.
- V 1.4.0.1:
Expand Down
6 changes: 6 additions & 0 deletions CHANGELOGRELEASE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
### V 1.4.0.3

This small release fixes a performance issue related to finding the latest version of all packages. The bug had severe detrimental effects on `forge` and `fetch`, which are now resolved.

We used this opportunity to switch to a new GHC version and new versions of a lot of dependencies for building trident.

### V 1.4.0.2

This release finally fully enables handling multiple Poseidon package versions with trident. It includes a significant overhaul of the selection language in `forge` and `fetch` with major changes in its implementation and, as a consequence, multiple (subtle, but strictly breaking) changes in its semantics.
Expand Down
2 changes: 1 addition & 1 deletion cabal.project
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
packages: ./*.cabal
with-compiler: ghc-9.2.7
with-compiler: ghc-9.4.7
allow-newer: table-layout:base
package pandoc-citeproc
flags: +embed_data_files
2 changes: 1 addition & 1 deletion poseidon-hs.cabal
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: poseidon-hs
version: 1.4.0.2
version: 1.4.0.3
synopsis: A package with tools for working with Poseidon Genotype Data
description: The tools in this package read and analyse Poseidon-formatted genotype databases, a modular system for storing genotype data from thousands of individuals.
license: MIT
Expand Down
9 changes: 5 additions & 4 deletions src/Poseidon/CLI/Fetch.hs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
module Poseidon.CLI.Fetch where

import Poseidon.EntityTypes (EntityInput, HasNameAndVersion (..),
IndividualInfo (..),
PacNameAndVersion (..), PoseidonEntity,
checkIfAllEntitiesExist,
determineRelevantPackages,
Expand All @@ -18,8 +17,10 @@ import Poseidon.Package (PackageReadOptions (..),
import Poseidon.ServerClient (ApiReturnData (..),
ArchiveEndpoint (..),
ExtendedIndividualInfo (..),
PackageInfo (..), processApiResponse,
qDefault, qPacVersion, (+&+))
PackageInfo (..),
extIndInfo2IndInfoCollection,
processApiResponse, qDefault,
qPacVersion, (+&+))
import Poseidon.Utils (LogA, PoseidonException (..),
PoseidonIO, envLogAction, logDebug,
logInfo, logWithEnv, padLeft)
Expand Down Expand Up @@ -81,7 +82,7 @@ runFetch (FetchOptions baseDirs entityInputs archiveE@(ArchiveEndpoint remoteURL
remoteIndList <- do
r <- processApiResponse (remoteURL ++ "/individuals" ++ qDefault archive) False
case r of
ApiReturnExtIndividualInfo indInfo -> return [IndividualInfo n g p | ExtendedIndividualInfo n g p _ _ <- indInfo]
ApiReturnExtIndividualInfo extIndInfos -> return $ extIndInfo2IndInfoCollection extIndInfos
_ -> error "should not happen"


Expand Down
6 changes: 3 additions & 3 deletions src/Poseidon/CLI/Forge.hs
Original file line number Diff line number Diff line change
Expand Up @@ -121,14 +121,14 @@ runForge (
if length entities > 10 then " and " ++ show (length entities - 10) ++ " more" else ""

-- check if all entities can be found. This function reports an error and throws and exception
checkIfAllEntitiesExist entities (getJointIndividualInfo allPackages)
getJointIndividualInfo allPackages >>= checkIfAllEntitiesExist entities
-- determine relevant packages
relevantPackages <- filterToRelevantPackages entities allPackages
logInfo $ (show . length $ relevantPackages) ++ " packages contain data for this forging operation"
when (null relevantPackages) $ liftIO $ throwIO PoseidonEmptyForgeException

-- get all individuals from the relevant packages
let allInds = getJointIndividualInfo relevantPackages
indInfoCollection <- getJointIndividualInfo relevantPackages

-- set entities to only packages, if --packagewise is set
let relevantEntities =
Expand All @@ -137,7 +137,7 @@ runForge (
else entities

-- determine indizes of relevant individuals
relevantIndices <- resolveUniqueEntityIndices relevantEntities allInds
relevantIndices <- resolveUniqueEntityIndices relevantEntities indInfoCollection

-- collect data --
-- janno
Expand Down
1 change: 1 addition & 0 deletions src/Poseidon/CLI/Survey.hs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeOperators #-}

module Poseidon.CLI.Survey where

Expand Down
4 changes: 2 additions & 2 deletions src/Poseidon/CLI/Validate.hs
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,8 @@ runValidate (ValidateOptions
(allPackages, packagesSkipped) <- readPoseidonPackageCollectionWithSkipIndicator pacReadOpts baseDirs
-- stop on duplicates
unless ignoreDup $ do
let allInds = getJointIndividualInfo allPackages
duplicateGroups = filter ((>1) . length)
(allInds, _) <- getJointIndividualInfo allPackages
let duplicateGroups = filter ((>1) . length)
. groupBy (\a b -> indInfoName a == indInfoName b)
. sortOn indInfoName $ allInds
unless (null duplicateGroups) $ do
Expand Down
34 changes: 18 additions & 16 deletions src/Poseidon/EntityTypes.hs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

module Poseidon.EntityTypes (
IndividualInfo (..),
IndividualInfoCollection,
renderNameWithVersion,
HasNameAndVersion (..),
PoseidonEntity(..),
Expand Down Expand Up @@ -242,15 +243,17 @@ instance HasNameAndVersion IndividualInfo where
getPacName = getPacName . indInfoPac
getPacVersion = getPacVersion . indInfoPac

-- | a tuple of a collection of IndividualInfos and a list of bools
-- indicating whether the given sample is in the latest version of packages
type IndividualInfoCollection = ([IndividualInfo], [Bool])

-- data types for the selection process

data EntityInput a = EntitiesDirect [a] | EntitiesFromFile FilePath -- an empty list is interpreted as "all packages"

-- | determine all packages with versions that contain individuals covered by the given entities
determineRelevantPackages :: (MonadThrow m, EntitySpec a) => [a] -> [IndividualInfo] -> m [PacNameAndVersion]
determineRelevantPackages entities indInfos = do
areLatest <- mapM (isLatestInCollection indInfos) indInfos
determineRelevantPackages :: (MonadThrow m, EntitySpec a) => [a] -> IndividualInfoCollection -> m [PacNameAndVersion]
determineRelevantPackages entities (indInfos, areLatest) = do
let relevantPacs = [ indInfoPac ind | (ind, l) <- zip indInfos areLatest, indInfoConformsToEntitySpecs ind l entities ]
return . nub . map makePacNameAndVersion $ relevantPacs

Expand All @@ -261,16 +264,15 @@ reportDuplicateIndividuals individuals = do -- loop over duplication groups
return (firstInd, [SpecificInd n' (head g) p | IndividualInfo n' g p <- duplicateGroup])

-- | this finds the indices of all individuals from an individual-list which are specified in the Entity list
resolveEntityIndices :: (MonadThrow m, EntitySpec a) => [a] -> [IndividualInfo] -> m [Int]
resolveEntityIndices entities indInfos = do
areLatest <- mapM (isLatestInCollection indInfos) indInfos
resolveEntityIndices :: (MonadThrow m, EntitySpec a) => [a] -> IndividualInfoCollection -> m [Int]
resolveEntityIndices entities (indInfos, areLatest) = do
let relevantIndizes = [ i | (i, ind, l) <- zip3 [0..] indInfos areLatest, indInfoConformsToEntitySpecs ind l entities ]
return relevantIndizes

resolveUniqueEntityIndices :: (EntitySpec a) => [a] -> [IndividualInfo] -> PoseidonIO [Int]
resolveUniqueEntityIndices entities indInfos = do
relevantIndices <- resolveEntityIndices entities indInfos
let duplicateReport = reportDuplicateIndividuals . map (indInfos !!) $ relevantIndices
resolveUniqueEntityIndices :: (EntitySpec a) => [a] -> IndividualInfoCollection -> PoseidonIO [Int]
resolveUniqueEntityIndices entities indInfoCollection = do
relevantIndices <- resolveEntityIndices entities indInfoCollection
let duplicateReport = reportDuplicateIndividuals . map ((fst indInfoCollection) !!) $ relevantIndices
-- check if there still are duplicates and if yes, then stop
unless (null duplicateReport) $ do
logError "There are duplicated individuals, but forge does not allow that"
Expand All @@ -283,13 +285,13 @@ resolveUniqueEntityIndices entities indInfos = do
return relevantIndices

-- | this returns a list of entities which could not be found
determineNonExistentEntities :: (MonadThrow m, EntitySpec a) => [a] -> [IndividualInfo] -> m EntitiesList
determineNonExistentEntities entities indInfos = do
return [ entity | entity <- map underlyingEntity entities, indices <- resolveEntityIndices [entity] indInfos, null indices]
determineNonExistentEntities :: (MonadThrow m, EntitySpec a) => [a] -> IndividualInfoCollection -> m EntitiesList
determineNonExistentEntities entities indInfoCollection = do
return [ entity | entity <- map underlyingEntity entities, indices <- resolveEntityIndices [entity] indInfoCollection, null indices]

checkIfAllEntitiesExist :: (EntitySpec a) => [a] -> [IndividualInfo] -> PoseidonIO ()
checkIfAllEntitiesExist entities indInfos = do
nonExistentEntities <- determineNonExistentEntities entities indInfos
checkIfAllEntitiesExist :: (EntitySpec a) => [a] -> IndividualInfoCollection -> PoseidonIO ()
checkIfAllEntitiesExist entities indInfoCollection = do
nonExistentEntities <- determineNonExistentEntities entities indInfoCollection
unless (null nonExistentEntities) $ do
logError "The following entities could not be found in the dataset"
forM_ nonExistentEntities (logError . show)
Expand Down
22 changes: 14 additions & 8 deletions src/Poseidon/Package.hs
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ import Poseidon.BibFile (BibEntry (..), BibTeX,
import Poseidon.Contributor (ContributorSpec (..), ORCID (..))
import Poseidon.EntityTypes (EntitySpec, HasNameAndVersion (..),
IndividualInfo (..),
IndividualInfoCollection,
PacNameAndVersion (..),
determineRelevantPackages,
isLatestInCollection,
Expand Down Expand Up @@ -805,14 +806,18 @@ getAllGroupInfo packages = do
isLatest <- isLatestInCollection (map makePacNameAndVersion packages) groupPac
return $ GroupInfo groupName groupPac isLatest groupNrInds

getJointIndividualInfo :: [PoseidonPackage] -> [IndividualInfo]
getJointIndividualInfo :: (MonadThrow m) => [PoseidonPackage] -> m IndividualInfoCollection
getJointIndividualInfo packages = do
pac <- packages
jannoRow <- getJannoRowsFromPac pac
return $ IndividualInfo
(jPoseidonID jannoRow)
((getJannoList . jGroupName) jannoRow)
(makePacNameAndVersion pac)
indInfoLatestPairs <- forM packages $ \pac -> do
isLatest <- isLatestInCollection packages pac
forM (getJannoRowsFromPac pac) $ \jannoRow -> do
let indInfo = IndividualInfo
(jPoseidonID jannoRow)
((getJannoList . jGroupName) jannoRow)
(makePacNameAndVersion pac)
return (indInfo, isLatest)
return (map fst . concat $ indInfoLatestPairs, map snd . concat $ indInfoLatestPairs)


getExtendedIndividualInfo :: (MonadThrow m) => [PoseidonPackage] -> [String] -> m [ExtendedIndividualInfo]
getExtendedIndividualInfo allPackages additionalJannoColumns = sequence $ do -- list monad
Expand All @@ -830,5 +835,6 @@ getExtendedIndividualInfo allPackages additionalJannoColumns = sequence $ do --
-- | Filter packages such that only packages with individuals covered by the given EntitySpec are returned
filterToRelevantPackages :: (MonadThrow m) => (EntitySpec a) => [a] -> [PoseidonPackage] -> m [PoseidonPackage]
filterToRelevantPackages entities packages = do
relevantPacs <- determineRelevantPackages entities (getJointIndividualInfo packages)
indInfoCollection <- getJointIndividualInfo packages
relevantPacs <- determineRelevantPackages entities indInfoCollection
return $ filter (\p -> makePacNameAndVersion p `elem` relevantPacs) packages
9 changes: 9 additions & 0 deletions src/Poseidon/ServerClient.hs
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,14 @@ module Poseidon.ServerClient (
processApiResponse,
ArchiveEndpoint(..),
PackageInfo (..), GroupInfo (..), ExtendedIndividualInfo(..),
extIndInfo2IndInfoCollection,
qDefault, qArchive, qPacVersion, (+&+)
) where

import Paths_poseidon_hs (version)
import Poseidon.EntityTypes (HasNameAndVersion (..),
IndividualInfo (..),
IndividualInfoCollection,
PacNameAndVersion (..))
import Poseidon.Janno (JannoRows)
import Poseidon.Utils (PoseidonException (..), PoseidonIO,
Expand Down Expand Up @@ -216,3 +219,9 @@ processApiResponse url quiet = do
Nothing -> do
logError "The server request was unsuccessful"
liftIO . throwIO . PoseidonServerCommunicationException $ "Server error upon URL " ++ url

extIndInfo2IndInfoCollection :: [ExtendedIndividualInfo] -> IndividualInfoCollection
extIndInfo2IndInfoCollection extIndInfos =
let indInfos = [IndividualInfo n g p | ExtendedIndividualInfo n g p _ _ <- extIndInfos]
areLatest = map extIndInfoIsLatest extIndInfos
in (indInfos, areLatest)
5 changes: 2 additions & 3 deletions stack.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
resolver: lts-20.24
resolver: lts-21.17

packages:
- .
Expand All @@ -7,8 +7,7 @@ extra-deps:
- data-default-instances-base-0.1.0.1
- sequence-formats-1.7.0
- pipes-ordered-zip-1.2.1
- co-log-0.5.0.0
- co-log-0.6.0.2
- ansi-terminal-0.10.3
- chronos-1.1.5
- typerep-map-0.5.0.0
allow-newer: true
23 changes: 8 additions & 15 deletions stack.yaml.lock
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,12 @@ packages:
original:
hackage: pipes-ordered-zip-1.2.1
- completed:
hackage: co-log-0.5.0.0@sha256:a7e84650eaef7eba2d59ee7664309e79317a7ca67011abedf971f0e6bd6475bb,5448
hackage: co-log-0.6.0.2@sha256:91c14447cb1cbdd6d76317e80d2acd75efe4d964975a57f9731d8af56a4fba7a,6395
pantry-tree:
sha256: 33b838c07c8b7e70b2e82bddc889bb1e6386d7e12a9d1593c0b4b263b1fcb925
size: 1043
sha256: 2ed9bcc839c3681796dd7ef797bcd65cf73e20fc0c542c94cae0a2532db7b3ba
size: 1198
original:
hackage: co-log-0.5.0.0
hackage: co-log-0.6.0.2
- completed:
hackage: ansi-terminal-0.10.3@sha256:e2fbcef5f980dc234c7ad8e2fa433b0e8109132c9e643bc40ea5608cd5697797,3226
pantry-tree:
Expand All @@ -53,16 +53,9 @@ packages:
size: 581
original:
hackage: chronos-1.1.5
- completed:
hackage: typerep-map-0.5.0.0@sha256:34f1ba9b268a6d52e26ae460011a5571e8099b50a3f4a7c8db25dd8efe3be8ee,4667
pantry-tree:
sha256: ca5565de307d260dc67f6dae0d4d33eee42a3238183461569b5142ceb909c91d
size: 1487
original:
hackage: typerep-map-0.5.0.0
snapshots:
- completed:
sha256: e019cd29e3f7f9dbad500225829a3f7a50f73c674614f2f452e21bb8bf5d99ea
size: 650253
url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/20/24.yaml
original: lts-20.24
sha256: 85d2382958c178491d3fe50d770a624621f5ab456beef7d31ac7521f780c9bc7
size: 640042
url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/21/17.yaml
original: lts-21.17
Loading

0 comments on commit f134faf

Please sign in to comment.