Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to avoid warning messages when tardy TPs arrive at the TPStreamWriter #358

Merged
merged 7 commits into from
Jul 6, 2024

Conversation

bieryAtFnal
Copy link
Collaborator

@bieryAtFnal bieryAtFnal commented Jul 2, 2024

This PR depends on ones in daqdataformats and hdf5libs.

These changes include a flag to control whether warning messages are produced when tardy TPs arrive at the TPStreamWriter. They also include additional metrics to provide insight into lost TPs when the warning messages are disabled.

Kurt Biery added 2 commits July 2, 2024 09:56
…ce to support our desire to not complain too loudly when late-arriving TPs can't be added to existing TimeSlices in the TPStreamWriter.
…mWriter and to better track such a condition with additional metrics.
@bieryAtFnal
Copy link
Collaborator Author

bieryAtFnal commented Jul 2, 2024

Questions to be discussed in meetings on Wednesday...

  1. Do we want to keep track of TPSets received vs. ones that actually have TPs in them? It seems that there are a fair number of heartbeat TPSets that don't contain any TPs to be stored on disk.
  2. Should we provide a top-level configuration parameter for controlling whether tardy TPs arriving at the TPStreamWriter produce a warning message? (lower-level param is warn_user_when_late_tps_are_discarded, and its default value is currently false)
  3. Is the content of the warning message OK? probably we'll want to drop the run number from the message...
  4. Are the name choices for the new metrics OK? e.g. "discarded" vs. "suppressed". And, "accumulated" vs. something else.
  5. Do we want to try to add more metrics to help diagnose upstream TP problems?

@bieryAtFnal
Copy link
Collaborator Author

bieryAtFnal commented Jul 2, 2024

Results from a 30-second emulated-data test run with 2 ReadoutApps, each with 3 links. In this run, the TPs from the two RUs were close in time (so no warnings should have been produced). The flag for allowing the warning messages was set to false.

(dbt) [biery@daq rundir]$ grep -A 55 tpswriter opmon_collated.json 
        "tpswriter": {
            "bytes_output": {
                "1719939781": 0,
                "1719939791": 0,
                "1719939801": 3789224,
                "1719939811": 6840560,
                "1719939821": 6840560,
                "1719939831": 4770584,
                "1719939841": 0
            },
            "timeslice_written": {
                "1719939781": 0,
                "1719939791": 0,
                "1719939801": 6,
                "1719939811": 10,
                "1719939821": 10,
                "1719939831": 7,
                "1719939841": 0
            },
            "tp_accumulated": {
                "1719939781": 0,
                "1719939791": 0,
                "1719939801": 89681,
                "1719939811": 122133,
                "1719939821": 122158,
                "1719939831": 62914,
                "1719939841": 0
            },
            "tp_written": {
                "1719939781": 0,
                "1719939791": 0,
                "1719939801": 67615,
                "1719939811": 122070,
                "1719939821": 122070,
                "1719939831": 85131,
                "1719939841": 0
            },
            "tpset_accumulated": {
                "1719939781": 0,
                "1719939791": 0,
                "1719939801": 43937,
                "1719939811": 60033,
                "1719939821": 60478,
                "1719939831": 31362,
                "1719939841": 0
            },
            "tpset_received": {
                "1719939781": 0,
                "1719939791": 0,
                "1719939801": 55871,
                "1719939811": 76097,
                "1719939821": 77526,
                "1719939831": 40141,
                "1719939841": 0
            }
        }
(dbt) [biery@daq rundir]$ egrep 'ERROR|WARN' log_*.txt* 
log_rulocalhosteth0_5336.txt:2024-Jul-02 12:03:14,490 LOG [void dunedaq::fdreadoutlibs::WIBEthFrameProcessor::sequence_check(frameptr) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/fdreadoutlibs/src/wibeth/WIBEthFrameProcessor.cpp:365] *** Data Integrity ERROR *** Sequence ID continuity is completely broken! Something is wrong with the FE source or with the configuration!
log_rulocalhosteth1_5337.txt:2024-Jul-02 12:03:14,498 LOG [void dunedaq::fdreadoutlibs::WIBEthFrameProcessor::sequence_check(frameptr) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/fdreadoutlibs/src/wibeth/WIBEthFrameProcessor.cpp:365] *** Data Integrity ERROR *** Sequence ID continuity is completely broken! Something is wrong with the FE source or with the configuration!

@bieryAtFnal
Copy link
Collaborator Author

Results from a 30-second emulated-data test run with 2 ReadoutApps, each with 3 links. In this run, the TPs from the two RUs were close in time (so no warnings should have been produced). The flag for allowing the warning messages was set to true. (but, no warning messages were produced, as expected)

(dbt) [biery@daq rundir]$ grep -A 55 tpswriter opmon_collated.json 
        "tpswriter": {
            "bytes_output": {
                "1719940153": 0,
                "1719940163": 0,
                "1719940173": 4207952,
                "1719940183": 6840560,
                "1719940193": 6840560,
                "1719940203": 4353048,
                "1719940213": 0
            },
            "timeslice_written": {
                "1719940153": 0,
                "1719940163": 0,
                "1719940173": 7,
                "1719940183": 10,
                "1719940193": 10,
                "1719940203": 7,
                "1719940213": 0
            },
            "tp_accumulated": {
                "1719940153": 0,
                "1719940163": 0,
                "1719940173": 90842,
                "1719940183": 122127,
                "1719940193": 122154,
                "1719940203": 61776,
                "1719940213": 0
            },
            "tp_written": {
                "1719940153": 0,
                "1719940163": 0,
                "1719940173": 75084,
                "1719940183": 122070,
                "1719940193": 122070,
                "1719940203": 77675,
                "1719940213": 0
            },
            "tpset_accumulated": {
                "1719940153": 0,
                "1719940163": 0,
                "1719940173": 44569,
                "1719940183": 60050,
                "1719940193": 60442,
                "1719940203": 30607,
                "1719940213": 0
            },
            "tpset_received": {
                "1719940153": 0,
                "1719940163": 0,
                "1719940173": 56587,
                "1719940183": 76205,
                "1719940193": 77402,
                "1719940203": 39146,
                "1719940213": 0
            }
        }
(dbt) [biery@daq rundir]$ egrep 'ERROR|WARN' log_*.txt* 
log_rulocalhosteth0_5336.txt:2024-Jul-02 12:09:25,878 LOG [void dunedaq::fdreadoutlibs::WIBEthFrameProcessor::sequence_check(frameptr) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/fdreadoutlibs/src/wibeth/WIBEthFrameProcessor.cpp:365] *** Data Integrity ERROR *** Sequence ID continuity is completely broken! Something is wrong with the FE source or with the configuration!
log_rulocalhosteth1_5337.txt:2024-Jul-02 12:09:25,885 LOG [void dunedaq::fdreadoutlibs::WIBEthFrameProcessor::sequence_check(frameptr) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/fdreadoutlibs/src/wibeth/WIBEthFrameProcessor.cpp:365] *** Data Integrity ERROR *** Sequence ID continuity is completely broken! Something is wrong with the FE source or with the configuration!

@bieryAtFnal
Copy link
Collaborator Author

Results from a 30-second emulated-data test run with 2 ReadoutApps, each with 3 links. In this run, the TPs from the two RUs had a significant amount of time between them (~2.5 seconds), so TPs should have been dropped. The flag for allowing the warning messages was set to false, so no warning message should have been produced, but the loss of TPs should be visible in the metrics.

(dbt) [biery@daq rundir]$ grep -A 61 tpswriter opmon_collated.json 
        "tpswriter": {
            "bytes_output": {
                "1719940845": 0,
                "1719940855": 0,
                "1719940865": 2654560,
                "1719940875": 3420440,
                "1719940885": 3420440,
                "1719940895": 3420440,
                "1719940905": 1678608,
                "1719940915": 0
            },
            "timeslice_written": {
                "1719940845": 0,
                "1719940855": 0,
                "1719940865": 9,
                "1719940875": 10,
                "1719940885": 10,
                "1719940895": 10,
                "1719940905": 5,
                "1719940915": 0
            },
            "tp_accumulated": {
                "1719940845": 0,
                "1719940855": 0,
                "1719940865": 88760,
                "1719940875": 122141,
                "1719940885": 122146,
                "1719940895": 122150,
                "1719940905": 39651,
                "1719940915": 0
            },
            "tp_written": {
                "1719940845": 0,
                "1719940855": 0,
                "1719940865": 47363,
                "1719940875": 61035,
                "1719940885": 61035,
                "1719940895": 61035,
                "1719940905": 29953,
                "1719940915": 0
            },
            "tpset_accumulated": {
                "1719940845": 0,
                "1719940855": 0,
                "1719940865": 43324,
                "1719940875": 59807,
                "1719940885": 60509,
                "1719940895": 60556,
                "1719940905": 19632,
                "1719940915": 0
            },
            "tpset_received": {
                "1719940845": 0,
                "1719940855": 0,
                "1719940865": 55077,
                "1719940875": 75954,
                "1719940885": 77688,
                "1719940895": 77631,
                "1719940905": 24954,
                "1719940915": 0
            }
        }
(dbt) [biery@daq rundir]$ egrep 'ERROR|WARN' log_*.txt* 
log_rulocalhosteth0_5336.txt:2024-Jul-02 12:20:58,296 LOG [void dunedaq::fdreadoutlibs::WIBEthFrameProcessor::sequence_check(frameptr) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/fdreadoutlibs/src/wibeth/WIBEthFrameProcessor.cpp:365] *** Data Integrity ERROR *** Sequence ID continuity is completely broken! Something is wrong with the FE source or with the configuration!
log_rulocalhosteth1_5337.txt:2024-Jul-02 12:20:58,301 LOG [void dunedaq::fdreadoutlibs::WIBEthFrameProcessor::sequence_check(frameptr) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/fdreadoutlibs/src/wibeth/WIBEthFrameProcessor.cpp:365] *** Data Integrity ERROR *** Sequence ID continuity is completely broken! Something is wrong with the FE source or with the configuration!

To avoid warning messages about TPs not being available in the RU latency buffers, the data request timeout was increased in this run, and extra time was allowed at the start and end of the run without triggers.

"daq_common": {
    "data_rate_slowdown_factor": 1,
    "data_request_timeout_ms": 5000
  },
nanorc --partition-number 4 mdapp_tpg ${USER}-test boot conf wait 20 start_run --wait 5 142 wait 30 disable_triggers wait 5 stop_run wait 20 scrap terminate```

@bieryAtFnal
Copy link
Collaborator Author

Results from a 30-second emulated-data test run with 2 ReadoutApps, each with 3 links. In this run, the TPs from the two RUs had a significant amount of time between them (~2.5 seconds), so TPs should have been dropped. The flag for allowing the warning messages was set to true, so warning message should have been produced, and the loss of TPs should be visible in the metrics.

(dbt) [biery@daq rundir]$ grep -A 61 tpswriter opmon_collated.json 
        "tpswriter": {
            "bytes_output": {
                "1719941175": 0,
                "1719941186": 0,
                "1719941196": 2601808,
                "1719941206": 3420440,
                "1719941216": 3420440,
                "1719941226": 3420440,
                "1719941236": 1708120,
                "1719941246": 0
            },
            "timeslice_written": {
                "1719941175": 0,
                "1719941186": 0,
                "1719941196": 9,
                "1719941206": 10,
                "1719941216": 10,
                "1719941226": 10,
                "1719941236": 5,
                "1719941246": 0
            },
            "tp_accumulated": {
                "1719941175": 0,
                "1719941186": 0,
                "1719941196": 80714,
                "1719941206": 122139,
                "1719941216": 122138,
                "1719941226": 122148,
                "1719941236": 47836,
                "1719941246": 0
            },
            "tp_written": {
                "1719941175": 0,
                "1719941186": 0,
                "1719941196": 46421,
                "1719941206": 61035,
                "1719941216": 61035,
                "1719941226": 61035,
                "1719941236": 30480,
                "1719941246": 0
            },
            "tpset_accumulated": {
                "1719941175": 0,
                "1719941186": 0,
                "1719941196": 39555,
                "1719941206": 59970,
                "1719941216": 60654,
                "1719941226": 60522,
                "1719941236": 23905,
                "1719941246": 0
            },
            "tpset_received": {
                "1719941175": 0,
                "1719941186": 0,
                "1719941196": 50320,
                "1719941206": 76096,
                "1719941216": 77626,
                "1719941226": 77515,
                "1719941236": 30514,
                "1719941246": 0
            }
        }
(dbt) [biery@daq rundir]$ egrep 'ERROR|WARN' log_*.txt* 
log_rulocalhosteth0_5336.txt:2024-Jul-02 12:26:29,375 LOG [void dunedaq::fdreadoutlibs::WIBEthFrameProcessor::sequence_check(frameptr) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/fdreadoutlibs/src/wibeth/WIBEthFrameProcessor.cpp:365] *** Data Integrity ERROR *** Sequence ID continuity is completely broken! Something is wrong with the FE source or with the configuration!
log_rulocalhosteth1_5337.txt:2024-Jul-02 12:26:29,378 LOG [void dunedaq::fdreadoutlibs::WIBEthFrameProcessor::sequence_check(frameptr) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/fdreadoutlibs/src/wibeth/WIBEthFrameProcessor.cpp:365] *** Data Integrity ERROR *** Sequence ID continuity is completely broken! Something is wrong with the FE source or with the configuration!
log_tpwriter_5339.txt:2024-Jul-02 12:26:33,449 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 4 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:34,449 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 5 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:35,450 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 6 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:36,449 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 7 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:37,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 8 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:38,449 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 9 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:39,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 10 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:40,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 11 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:41,450 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 12 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:42,449 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 13 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:43,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 14 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:44,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 15 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:45,450 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 16 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:46,449 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 17 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:47,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 18 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:48,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 19 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:49,450 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 20 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:50,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 21 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:51,449 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 22 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:52,449 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 23 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:53,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 24 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:54,449 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 25 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:55,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 26 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:56,447 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 27 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:57,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 28 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:58,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 29 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:26:59,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 30 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:27:00,450 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 31 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:27:01,448 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 32 in run 151 DAQModule: tpswriter
log_tpwriter_5339.txt:2024-Jul-02 12:27:02,447 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/02JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:240] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 33 in run 151 DAQModule: tpswriter

@wesketchum
Copy link
Contributor

Questions to be discussed in meetings on Wednesday...

1. Do we want to keep track of TPSets received vs. ones that actually have TPs in them?  It seems that there are a fair number of _heartbeat_ TPSets that don't contain any TPs to be stored on disk.

2. Should we provide a top-level configuration parameter for controlling whether tardy TPs arriving at the TPStreamWriter produce a warning message?  (lower-level param is `warn_user_when_late_tps_are_discarded`, and its default value is currently `false`)

3. Is the content of the warning message OK?  probably we'll want to drop the run number from the message...

4. Are the name choices for the new metrics OK?  e.g. "discarded" vs. "suppressed".  And, "accumulated" vs. something else.

5. Do we want to try to add more metrics to help diagnose upstream TP problems?
  1. I think that's reasonable, yes, but not distinguishing these at first order is ok.
  2. Yes, making that confiugrable seems very reasonable.
  3. Yes, it looks ok. Knowing how tardy they were may be additionally useful. Agree we should remove the run number.
  4. I like 'discarded'. 'accumulated' is per interval or over the run? If it's per interval, I may prefer 'received' or 'collected' or something, as accumulated to me has a feel of continually accumulating over a long period (like in a run). But I don't have a strong feeling on that.
  5. the only other thing I can immediately think of (which maybe is done) would be to track that the start/end windows of TPSets line up and know if we are likely missing one, receiving out of order, etc. per source ID. This probably requires more thinking to handle properly though, so is maybe more a future consideration.

@bieryAtFnal
Copy link
Collaborator Author

Here are samples of the latest metrics (shown in collated form from the disk-output opmon mode):

        "tpswriter": {
            "bytes_output": {
                "1720206388": 0,
                "1720206398": 0,
                "1720206408": 3645752,
                "1720206418": 6840560,
                "1720206428": 6840560,
                "1720206438": 6840560,
                "1720206448": 3647320,
                "1720206458": 0
            },
            "heartbeat_tpsets_received": {
                "1720206388": 0,
                "1720206398": 0,
                "1720206408": 11363,
                "1720206418": 16063,
                "1720206428": 17432,
                "1720206438": 17386,
                "1720206448": 6241,
                "1720206458": 0
            },
            "tardy_timeslice_max_seconds": {
                "1720206388": 0.0,
                "1720206398": 0.0,
                "1720206408": 0.0,
                "1720206418": 0.0,
                "1720206428": 0.0,
                "1720206438": 0.0,
                "1720206448": 0.0,
                "1720206458": 0.0
            },
            "timeslices_written": {
                "1720206388": 0,
                "1720206398": 0,
                "1720206408": 6,
                "1720206418": 10,
                "1720206428": 10,
                "1720206438": 10,
                "1720206448": 6,
                "1720206458": 0
            },
            "total_tps_received": {
                "1720206388": 0,
                "1720206398": 0,
                "1720206408": 85923,
                "1720206418": 208076,
                "1720206428": 330219,
                "1720206438": 452374,
                "1720206448": 496344,
                "1720206458": 496344
            },
            "total_tps_written": {
                "1720206388": 0,
                "1720206398": 0,
                "1720206408": 65053,
                "1720206418": 187123,
                "1720206428": 309193,
                "1720206438": 431263,
                "1720206448": 496344,
                "1720206458": 496344
            },
            "tps_received": {
                "1720206388": 0,
                "1720206398": 0,
                "1720206408": 85923,
                "1720206418": 122153,
                "1720206428": 122143,
                "1720206438": 122155,
                "1720206448": 43970,
                "1720206458": 0
            },
            "tps_written": {
                "1720206388": 0,
                "1720206398": 0,
                "1720206408": 65053,
                "1720206418": 122070,
                "1720206428": 122070,
                "1720206438": 122070,
                "1720206448": 65081,
                "1720206458": 0
            },
            "tpsets_with_tps_received": {
                "1720206388": 0,
                "1720206398": 0,
                "1720206408": 41924,
                "1720206418": 59855,
                "1720206428": 60715,
                "1720206438": 60803,
                "1720206448": 22020,
                "1720206458": 0
            }
        }

and here is a sample warning message:

log_tpwriter_5839.txt:2024-Jul-05 14:03:17,450 WARNING [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /home/nfs/biery/dunedaq/05JulFDv4.4.4Testing/sourcecode/dfmodules/plugins/TPStreamWriter.cpp:260] Tardy TPs from SourceIDs [Trigger_0x00000003,Trigger_0x00000004,Trigger_0x00000005] were discarded from TimeSlice number 9 (~2 sec too late) DAQModule: tpswriter

@wesketchum wesketchum merged commit bfbdf46 into patch/fddaq-v4.4.x Jul 6, 2024
@wesketchum wesketchum deleted the kbiery/tpsw_ignore_late_tps branch July 6, 2024 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants