Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we still need prentsbyLumi API? #613

Open
yuyiguo opened this issue Jun 24, 2019 · 20 comments
Open

Do we still need prentsbyLumi API? #613

yuyiguo opened this issue Jun 24, 2019 · 20 comments

Comments

@yuyiguo
Copy link
Member

yuyiguo commented Jun 24, 2019

@amaltaro
The listFileParentsbyLumi(bock_name) API was requested by Seangchan while he had to deal with a bug in the agent that made some of files don't have parents. We had an unit test for the API. However, this API most time cannot finished in 5 minutes due to the blocks and database are much bigger than the time we created the API. This API was only created for Seancheng to do the recovery. If no more recovery to do. I 'd like to disable it.

@amaltaro
Copy link
Contributor

amaltaro commented Jul 1, 2019

@yuyiguo I don't see listFileParentsByLumi being called anywhere in the WMCore code.
I also scanned all the DBSGlobalReader-* logs for one of the cmsweb backends and there was no match.

I think it's safe to deprecate/disable this API.

@yuyiguo
Copy link
Member Author

yuyiguo commented Jul 1, 2019 via email

@amaltaro
Copy link
Contributor

Hi Yuyi,

let me get back to this issue. As we have recently noticed, we do rely heavily on this listFileParentsByLumi API, which is used to update the parentage information for output datasets of StepChain workflows.

Indeed it times out for quite some blocks that we're trying to process, see WMCore issue:
dmwm/WMCore#9537

while trying to provide a slice of LFNs (in order to workaround the frontends timeout), I found the following server bug (reported as a client bug):

>>> dbsApi = DbsApi(url = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader/')
>>> block
'/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/RunIISummer19UL18HLT-102X_upgrade2018_realistic_v15-v2/GEN-SIM-RAW#1bb50416-6f34-40ce-b8e4-fb301d3853f3'
>>> listFiles[:5]
['/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/2E5C12EF-5BEA-6A4C-83E1-6CE10B9960CC.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/6D055153-A777-4C46-88EB-70B6873BDC72.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/9570805C-4EB2-D74C-BD25-81FB342FA1C6.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/0BC212B6-8A2C-2F4E-980F-429BD1AC56FE.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/5096D189-A9B5-9B4C-9D7F-158B655C9B5F.root']
>>> 
>>> res = dbsApi.listFileParentsByLumi(block_name=block, logical_file_name=listFiles[:5])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/srv/wmagent/v1.2.8/sw/slc7_amd64_gcc630/cms/dbs3-client/3.10.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 484, in listFileParentsByLumi
    return self.__callServer("fileparentsbylumi", data=kwargs, callmethod='POST')
  File "/data/srv/wmagent/v1.2.8/sw/slc7_amd64_gcc630/cms/dbs3-client/3.10.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 201, in __callServer
    self.__parseForException(http_error)
  File "/data/srv/wmagent/v1.2.8/sw/slc7_amd64_gcc630/cms/dbs3-client/3.10.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 228, in __parseForException
    raise HTTPError(http_error.url, data['exception'], data['message'], http_error.header, http_error.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 500: (DatabaseError) ORA-00928: missing SELECT keyword

Do we have another option to try to make it working? The block above has 231 files, and that's already too much to fit within 5min.

@amaltaro
Copy link
Contributor

The problem is likely somewhere in this DAO:
https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/dao/Oracle/FileParent/ListFileParentageByLumi.py

but I'm having a hard time to read it, and I can't access the DBS oracle account, so I can't even play with it.

@yuyiguo
Copy link
Member Author

yuyiguo commented Feb 10, 2020 via email

@yuyiguo
Copy link
Member Author

yuyiguo commented Feb 10, 2020

Alan,

It is a lot of message here. In simple words, I have no solution to get that work in 5 minutes. I tried the example blocks and list of files in this ticket with bare sql statements and no matter what I did, it just could not finish in 5 minutes. We have big datasets now and it could not finish in 5 minutes. Using LFN list will not reduce the searching time. I think the reason was because the LFNs were only limit to the children, but the parents were for the entire datasets. The children already much smaller than the parents because we use block to limit them.

Details:

The original sql:
`
WITH
TOKEN_GENERATOR AS (
SELECT REGEXP_SUBSTR(:token_0, '[^,]+', 1, LEVEL) token
FROM DUAL
CONNECT BY LEVEL <= LENGTH(:token_0) - LENGTH(REPLACE(:token_0, ',', '')) + 1

),
parents as
(

    select run_num as R, Lumi_section_num as L, file_id as pid from cms_dbs3_prod_global_owner.file_lumis fl       
    where fl.file_id in 
             (select file_id from cms_dbs3_prod_global_owner.files f
                 where F.DATASET_ID in (select parent_dataset_id from cms_dbs3_prod_global_owner.dataset_parents dp        
                  inner join cms_dbs3_prod_global_owner.datasets d on d.dataset_id=DP.THIS_DATASET_ID         
                  where d.dataset = :child_ds_name ))),             
        
        children as            
        (       
   select  run_num as R, Lumi_section_num as L, file_id as cid from cms_dbs3_prod_global_owner.file_lumis fl        
        where fl.file_id in (select file_id from cms_dbs3_prod_global_owner.files f        
         inner join cms_dbs3_prod_global_owner.blocks b on f.block_id = b.block_id         
         where b.block_name = :child_block_name and f.logical_file_name in (SELECT TOKEN FROM TOKEN_GENERATOR) ))                                             )            

select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L
`
There was bug there in the third sub query "children as". Oracle does not allow any sub query use the other sub queries. That is why we got error " (DatabaseError) ORA-00928: missing SELECT keyword".

So what I did was just remove the file list input and search for the entire block. See below query:
`
WITH
TOKEN_GENERATOR AS (
SELECT REGEXP_SUBSTR(:token_0, '[^,]+', 1, LEVEL) token
FROM DUAL
CONNECT BY LEVEL <= LENGTH(:token_0) - LENGTH(REPLACE(:token_0, ',', '')) + 1

),
parents as
(

    select run_num as R, Lumi_section_num as L, file_id as pid from cms_dbs3_prod_global_owner.file_lumis fl       
    where fl.file_id in 
             (select file_id from cms_dbs3_prod_global_owner.files f
                 where F.DATASET_ID in (select parent_dataset_id from cms_dbs3_prod_global_owner.dataset_parents dp        
                  inner join cms_dbs3_prod_global_owner.datasets d on d.dataset_id=DP.THIS_DATASET_ID         
                  where d.dataset = :child_ds_name ))),             
        
        children as            
        (       
   select  run_num as R, Lumi_section_num as L, file_id as cid from cms_dbs3_prod_global_owner.file_lumis fl        
        where fl.file_id in (select file_id from cms_dbs3_prod_global_owner.files f        
         inner join cms_dbs3_prod_global_owner.blocks b on f.block_id = b.block_id         
         where b.block_name = :child_block_name ))    

select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L
;

`
above query took 5:18 minutes.

Then I added the file list back to the query as below, instead of use subquery, with with clause, I put the query directly there.
`
WITH
parents as
(

    select run_num as R, Lumi_section_num as L, file_id as pid from cms_dbs3_prod_global_owner.file_lumis fl       
    where fl.file_id in 
             (select file_id from cms_dbs3_prod_global_owner.files f
                 where F.DATASET_ID in (select parent_dataset_id from cms_dbs3_prod_global_owner.dataset_parents dp        
                  inner join cms_dbs3_prod_global_owner.datasets d on d.dataset_id=DP.THIS_DATASET_ID         
                  where d.dataset = :child_ds_name ))),             
        
        children as            
        (       
   select  run_num as R, Lumi_section_num as L, file_id as cid from cms_dbs3_prod_global_owner.file_lumis fl        
        where fl.file_id in (select file_id from cms_dbs3_prod_global_owner.files f        
         inner join cms_dbs3_prod_global_owner.blocks b on f.block_id = b.block_id         
         where b.block_name = :child_block_name and F.LOGICAL_FILE_NAME in (SELECT REGEXP_SUBSTR(:token_0, '[^,]+', 1, LEVEL) token
    FROM DUAL
    CONNECT BY LEVEL <= LENGTH(:token_0) - LENGTH(REPLACE(:token_0, ',', '')) + 1)) )       
         
 select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L 

;
`
This query took 6:43 minutes. As we can see limit with LFns did not help.

@amaltaro
Copy link
Contributor

I see! Thank you for looking into it, Yuyi.
We will have then to redesign the whole StepChain parentage, in addition to a possible workaround for the upcoming days/weeks (like building the parentage on the client side, and posting the parentage relationship directly to an API). Is there such DBS write API?

@yuyiguo
Copy link
Member Author

yuyiguo commented Feb 10, 2020

Alan,
I am not sure I understand you question.
Are you asking how should we handle the missing parentage data already in DBS?
I am still thinking how to fix that.

@yuyiguo
Copy link
Member Author

yuyiguo commented Feb 11, 2020

Alan,

In order to handle the data already generated in DBS. I am think that we may break listFileParentsbyLumi(bock_name) API into two APIs.

  1. listParentsFileLumis(child_ds_name), this is the compound query's first part. This find the lumi number and file id of the parent dataset.

  2. listFileLumis(child_block_name), this is the compound query's second part. This find the lumi number and file id of the child block.

Once we have both data. We can find the unique match of 1 and 2 in python.

If you think this is something you guys want to approach, I will test if we can get all the lumi numbers for a big dataset in 5 minutes.

Yuyi

@vkuznet
Copy link
Contributor

vkuznet commented Feb 11, 2020

I really don't know if proposed solution will sustain since it seems to me that sooner or later you'll again hit the limit of 5 min. This I consider as a temporary fix but it does not solve the problem. If DB can't handle the load with parentage they way DMWM queries it we can't fix it in APIs, we either need to do something on DB level to speed up those queries, e.g. run procedure function to generate this info in background, re-factor DB to handle parentage use-case, or move/use other solution (HDFS) to get parentage.

But before that I rather prefer to see full description WHY do we need to support this use-case. From ticket description it is unclear why DMWM needs this, apart that it is heavily rely on this.

@amaltaro
Copy link
Contributor

Yuyi,

I'm not sure I followed your suggestion. Are you saying that we could:

  • provided a parent dataset, list the:
    • file id
    • run/lumi list for each of those files
  • then we provide the child dataset (or block), and get:
    • their file id
    • their run/lumi list for each of those files

And fix the parentage relationship on the application side, then injecting a list of parent file ids for a given child file id? If this is what you're saying, then we should likely use block level operations to avoid eating the whole memory when parsing it.

Valentin, the problem has been reported here:
dmwm/WMCore#8269

and this is how we decided to solve StepChain parentage handling; because things happen asynchronously, meaning that we could insert into DBS data for a NANOAODSIM dataset, while the AODSIM hasn't even been merged yet.

I don't discard having the posssibility to fix it withing WMAgent, but I'm pretty sure that will be a substantial change and it can't be done within a few weeks.

@yuyiguo
Copy link
Member Author

yuyiguo commented Feb 11, 2020

Valentin, Alan,

What I proposed was that a temperate solution to fix the existing data that already in DBS.
The questions are:

  1. When the problem will be fixed in WMAgent so that we will not have to fix the parentages any more?
  2. How many blocks do not have file level parentage in DBS?

When I proposed that solution, I thought that

  1. The WMAgent will be fixed in next a few month. If this is not the case, what is the schedule?
  2. The number of blocks have missing file parentage is vert small. That was why no one found it after many years.

Maybe we should discuss more on the problem before offering any fixing.
Thanks,
Yuyi

@amaltaro
Copy link
Contributor

Yuyi, it actually isn't/wasn't in our plans to modify how the StepChain parentage works because I wasn't aware of such limitations.
I can try to find out the number of blocks without parentage information, but it's likely around hundreds of blocks (hopefully not a few thousand).

I think the question back then when this issue was being discussed was: how much work would it be to get such problem fixed in WMCore and in DBS. Which is still a valid question.
As I said before, we can certainly have a fix for it in WMAgent, likely not a clean solution, and likely something that will take us weeks of development; not impossible though.

@yuyiguo
Copy link
Member Author

yuyiguo commented Feb 11, 2020

Alan,
If I understood you correctly, WMAgent/stepChain will always send broken/partial data to DBS? In other words, we are looking for solutions to fix the data in DBS instead of fixing the source of the problem.

WMAgents have their own databases, Can the partial data wait in its local db? WMAgent insert NANOAODSIM dataset into its local database while waiting for AODSIM is merged. Then upload the completed dataset/block into DBS?

@amaltaro
Copy link
Contributor

Your understanding is correct, Yuyi. The merging step is asynchronous, and so is data injection against phedex/DBS. So yes, StepChain output will likely be always broken.

Yes, we can definetely implement what we need in the agents. However, I'd like first to explore a DBS/database-side option, if we still can find one.

FYI @todor-ivanov

@yuyiguo
Copy link
Member Author

yuyiguo commented Feb 11, 2020

Alan,
How you fix the data in DBS now? Do you have a cron job running in some frequency to fix it?

We are going to partition DBS files and file_lumis tables. I think the queries against these two tables will be improved after the partition. But the partition will take some time. How long can you wait? Kate is out of office this week. I will discuss with her the partition schedule next week.

@amaltaro
Copy link
Contributor

We have a cherrypy thread running every 3h:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py

even though it looks like the current cycle is running since the cmsweb production upgrade (it's has been implemented sequentially). Do you think it could hurt DBS too badly if we fix the parentage with X concurrent requests against DBS (like 10 blocks concurrently)?

I believe it would be okay to wait for a few weeks, but we might be unlucky and have users looking into those samples and their parents, which will increase the priority to get it fixed.

I'm also about to leave on vacation (day after tomorrow).

@vkuznet
Copy link
Contributor

vkuznet commented Feb 12, 2020 via email

@amaltaro
Copy link
Contributor

I had the impression that select/join statements wouldn't block tables; we are also not updating anything within the same transaction, so it might not hurt the database performance. Yuyi, can you please confirm that?

@vkuznet
Copy link
Contributor

vkuznet commented Feb 12, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants