Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCC > SRT error, Domesday LD Capture: AttributeError: 'NoneType' object has no attribute 'append_text' #394

Open
rktcc opened this issue Aug 15, 2023 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@rktcc
Copy link

rktcc commented Aug 15, 2023

ttconv 1.0.7 (pip install --pre ttconv)
python 3.11

head.scc.txt (rename from .txt to .scc since Github didn't like .scc.)

This is an SCC file extracted from a LaserDisc film captured using a Domesday Duplicator. Additionally, this is a Japanese language film.

This issue has occurred in the past with other Domesday captures but I used https://github.com/atsampson/ttconv until it stopped working now, and I can't sort out what changes they made before merging the updates to 1.0.7.

Unsupported SCC word: 0x7c                                                  
Unsupported SCC word: 0x7c                                                  
Unsupported SCC word: 0x107c                                                
Reading: |███████-------------------------------------------|  15% CompleteTraceback (most recent call last):
  File "/home/pip/.local/venv/ttconv/bin/tt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/tt.py", line 439, in main
    args.func(args)
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/tt.py", line 320, in convert
    model = scc_reader.to_model(file_as_str, reader_config, progress_callback_read)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/scc/reader.py", line 621, in to_model
    context.process_line(scc_line)
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/scc/reader.py", line 556, in process_line
    self.process_text(word, line.time_code)
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/scc/reader.py", line 460, in process_text
    self.buffered_caption.append_text(word)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'append_text'

I wonder if possibly the capture has errors or is flawed and this is causing the "unsupported characters", or if it's just because Japanese character set is not supported?

thank you

@palemieux
Copy link
Contributor

@valnoel Can you look at this issue in the context of your work improving the SCC reader?

@rktcc
Copy link
Author

rktcc commented Aug 15, 2023

It was raised to my attention that it seems the Japanese character sets are not present in the scc codes file, thus I imagine this might be a difficult task to achieve?

It was also noted that the content in the example is "two byte unicode", not sure if that's helpful. Just passing on some information from the Domesday group conversation.

Thanks to the maintainers for assistance!

@valnoel
Copy link
Collaborator

valnoel commented Aug 16, 2023

The SCC reader does not currently support Japanese characters, which do not appear in the CEA-608 specification.

It seems an extension was once submitted to the specification, but I don't have any more information about it...

Otherwise, it seems CEA-708 introduces the Unicode characters support, which allow the display of Japanese and other languages.

@palemieux What do you think?

@palemieux
Copy link
Contributor

Otherwise, it seems CEA-708 introduces the Unicode characters support, which allow the display of Japanese and other languages.

Ok will look at this next week.

@palemieux palemieux self-assigned this Aug 16, 2023
@palemieux palemieux added the enhancement New feature or request label Aug 16, 2023
@palemieux
Copy link
Contributor

@rktcc Can you provide a link to the forum discussion thread? I could not find any specification for carrying arbitrary unicode characters in SCC.

@rktcc
Copy link
Author

rktcc commented Sep 12, 2023

@rktcc Can you provide a link to the forum discussion thread? I could not find any specification for carrying arbitrary unicode characters in SCC.

Hi, I am sorry for the delay.

Here is the discussion on ttconv missing Japanese character sets:

https://discord.com/channels/665557267189334046/676084498097766451/1140876443719577650

I think it's not the encoding and decoding that's wrong, there needs to be EIA-608 support added to ttconv and a way to detect EIA-608
https://github.com/sandflow/ttconv/tree/master/src/main/python/ttconv/scc/codes there's no Japanese character support at all
https://en.m.wikipedia.org/wiki/EIA-608

Here is a thought that the Norpak Non-Western addition may be what's needed...

https://discord.com/channels/665557267189334046/676084498097766451/1141486766579265576

Wikipedia says that there's non-western character support from Norpak https://en.m.wikipedia.org/wiki/EIA-608 under Non-Western Norpak Character Sets

Someone mentions a reference of CEA-608 set 6.4, Table 4, for Asian languages; however only PRC and (South) Korea are mentioned.

https://discord.com/channels/665557267189334046/676084498097766451/1141484827619635240

Referencing 6.4 Character Sets (Normative), 6.4.1 Standard, CEA-608
https://media.discordapp.net/attachments/676084498097766451/1141486499452432464/image.png

There's also a thought that it could be CC/Teletext, however as other subtitle content has been extracted from LaserDiscs using the Domesday, and converted from SCC to plaintext SRT, I would have to guess the Japanese SCC data would be the same, just the character sets missing from ttconv.

Did Japan use CC? ISTR that they had a teletext-like system for magazine-type data - which may also have worked for subtitles/closed-captions? (I know the wikipedia article mentions that two-byte stuff was added to the spec, but could that be like 50Hz being added to ATSC 1.0 - in an attempt to capture markets that didn't happen?) https://en.wikipedia.org/wiki/JTES was the teletext system (CCIR System D?)

I hope this is helpful in some capacity in either closing the ticket due to lack of project support, or adding some kind of additional processing.

If more info is needed I can look more. The Discord is free to join, sadly this is not hosted on an actual forum. Alternatively the general chat can be joined from IRC, on channel #domesday86 on https://libera.chat IRC network; you would not need to sign up for Discord in that case as a bot hands messages each way.

Discord Invite: https://github.com/happycube/ld-decode#documentation

Thank you again

@palemieux
Copy link
Contributor

I have joined the discord server.

In the meantime, I have spent some quality time staring at the sample file and it does not look like CEA 608 at all, e.g.:

image

Is that noise/errors from the laserdisc capture? Could it be something totally different like bitmaps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants