Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support searching videos [subtitles, captions, transcripts?] #152

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

deldesir
Copy link
Collaborator

@deldesir deldesir commented Mar 28, 2024

🚀 Pull Request Overview:
With this PR, searching subtitles is now possible. The video titles returned as results are used to enhance Calibre-Web's simple search.

📋 Checklist:

root@box:/usr/local/calibre-web-py3/cps# lb-wrapper search feelings
/usr/local/bin/lb
2024-03-28 11:03:52 - [Info] Running xklb command: lb search /library/calibre-web/xklb-metadata.db feelings
3 captions
Why you procrastinate even when it feels bad - /library/calibre-web/TED-Ed/Why you procrastinate even when it feels bad (62)/Why you procrastinate even when it feels b - TED-Ed.mp4
    2:11 But we’re most likely to procrastinate tasks that evoke negative feelings,
    2:55 Because procrastination is motivated by our negative feelings,
    3:58 ongoing feelings of shame,

image

📌 Testing scenario:

  • Download a single video or playlist
  • Use the search field to search for some word or phrase present in the subtitles. You might need to watch a video to get some terms ;-)
  • Alternatively, you can get an idea of how this works in the background by doing lb-wrapper search <your_term(s)> using the terminal.

Related to #140

cc @EMG70

Function to search through subtitles in xklb-metadata.db. The video titles returned as results are used to enhance Calibre-Web's simple search.
@deldesir deldesir added the enhancement New feature or request label Mar 28, 2024
@deldesir deldesir requested a review from holta March 28, 2024 14:58
@deldesir deldesir self-assigned this Mar 28, 2024
@deldesir deldesir changed the title Create lb_search.py Support searching videos subtitles Mar 28, 2024
@deldesir
Copy link
Collaborator Author

deldesir commented May 2, 2024

This PR needs #140 to work. It's ready for merge.

cps/db.py Show resolved Hide resolved
# the search_query function below only searches for books titles
result += self.search_query(term_part, config, *join).order_by(*order).all()
# we need to remove duplicates because the same book/video could be found multiple times
result = list(set(result))
Copy link
Member

@holta holta May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking myself if (a) user's search term/query (e.g. "feelings") and (b) lists of video/book titles whose subtitles contain that term/query ...can be more crisply/cleanly disambiguated.

@deldesir can you please clarify:

  • Is var result on Line 971 of cps/db.py a list of Calibre-Web book/video IDs — e.g. Calibre-Web's actual counting numbers like [1, 2, 3, 27] that appear in its web UI? i.e. Is that what function search_query outputs, do you know...?

    calibre-web/cps/db.py

    Lines 909 to 938 in a5486db

    def search_query(self, term, config, *join):
    term.strip().lower()
    self.session.connection().connection.connection.create_function("lower", 1, lcase)
    q = list()
    author_terms = re.split("[, ]+", term)
    for author_term in author_terms:
    q.append(Books.authors.any(func.lower(Authors.name).ilike("%" + author_term + "%")))
    query = self.generate_linked_query(config.config_read_column, Books)
    if len(join) == 6:
    query = query.outerjoin(join[0], join[1]).outerjoin(join[2]).outerjoin(join[3], join[4]).outerjoin(join[5])
    if len(join) == 3:
    query = query.outerjoin(join[0], join[1]).outerjoin(join[2])
    elif len(join) == 2:
    query = query.outerjoin(join[0], join[1])
    elif len(join) == 1:
    query = query.outerjoin(join[0])
    cc = self.get_cc_columns(config, filter_config_custom_read=True)
    filter_expression = [Books.tags.any(func.lower(Tags.name).ilike("%" + term + "%")),
    Books.series.any(func.lower(Series.name).ilike("%" + term + "%")),
    Books.authors.any(and_(*q)),
    Books.publishers.any(func.lower(Publishers.name).ilike("%" + term + "%")),
    func.lower(Books.title).ilike("%" + term + "%")]
    for c in cc:
    if c.datatype not in ["datetime", "rating", "bool", "int", "float"]:
    filter_expression.append(
    getattr(Books,
    'custom_column_' + str(c.id)).any(
    func.lower(cc_classes[c.id].value).ilike("%" + term + "%")))
    return query.filter(self.common_filters(True)).filter(or_(*filter_expression))
  • Or, maybe it's a list of some equivalent book/video pointers within SQLite ?

(Please paste in an actual result sample, as an example will be extremely useful!)

Copy link
Member

@holta holta May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deldesir are all these variables that start with cc and cc_ about Calibre and/or Calibre-Web "custom columns" ?

(In function search_query and similar functions, within cps/db.py ?)

And if so, can we mostly ignore those for now?!

Copy link
Member

@holta holta May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deldesir inserted stub Line 961 below: (to log the value of variable result)

        result = self.search_query(term, config, *join).order_by(*order).all()
        log.debug("***Search results***: {}".format(result))

...yielding tail -f /var/log/calibre-web.log output:

[2024-05-29 09:54:46,674] INFO {cps.server:268} Starting Tornado server on :8083
[2024-05-29 09:55:01,271] DEBUG {cps.db:961} Search results: [(<Books('Top 5 MISTAKES Beginner Rides Make in TRAFFIC,Top 5 MISTAKES Beginner Rides Make in TRAFFICChaseontwowheels2024-05-29 13:07:08.6832822023-11-15 00:00:001.02024-05-29 13:07:08.683286Chaseontwowheels/Top 5 MISTAKES Beginner Rides Make in TRAFFIC (7)1')>, None, None)]

Copy link
Member

@holta holta May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (7) in the above output definitely appears to be the book/video ID, and other aspects of the result variable (tags, series, authors, publishers, ETC!) might be more understandable thanks to:

calibre-web/cps/db.py

Lines 363 to 390 in a5486db

class Books(Base):
__tablename__ = 'books'
DEFAULT_PUBDATE = datetime(101, 1, 1, 0, 0, 0, 0) # ("0101-01-01 00:00:00+00:00")
id = Column(Integer, primary_key=True, autoincrement=True)
title = Column(String(collation='NOCASE'), nullable=False, default='Unknown')
sort = Column(String(collation='NOCASE'))
author_sort = Column(String(collation='NOCASE'))
timestamp = Column(TIMESTAMP, default=datetime.utcnow)
pubdate = Column(TIMESTAMP, default=DEFAULT_PUBDATE)
series_index = Column(String, nullable=False, default="1.0")
last_modified = Column(TIMESTAMP, default=datetime.utcnow)
path = Column(String, default="", nullable=False)
has_cover = Column(Integer, default=0)
uuid = Column(String)
isbn = Column(String(collation='NOCASE'), default="")
flags = Column(Integer, nullable=False, default=1)
authors = relationship(Authors, secondary=books_authors_link, backref='books')
tags = relationship(Tags, secondary=books_tags_link, backref='books', order_by="Tags.name")
comments = relationship(Comments, backref='books')
data = relationship(Data, backref='books')
series = relationship(Series, secondary=books_series_link, backref='books')
ratings = relationship(Ratings, secondary=books_ratings_link, backref='books')
languages = relationship(Languages, secondary=books_languages_link, backref='books')
publishers = relationship(Publishers, secondary=books_publishers_link, backref='books')
identifiers = relationship(Identifiers, backref='books')

@holta holta changed the title Support searching videos subtitles Support searching videos [subtitles, captions, transcripts?] Jun 10, 2024
@deldesir deldesir closed this Jul 1, 2024
@deldesir deldesir deleted the deldesir-patch-26 branch July 1, 2024 21:39
@deldesir deldesir restored the deldesir-patch-26 branch July 1, 2024 22:01
@deldesir deldesir reopened this Jul 1, 2024
@deldesir
Copy link
Collaborator Author

deldesir commented Aug 31, 2024

The search feature using lb search underlyingly is fixed with the last three commits. It's ready for testing. PR #244 will make the same possible using a direct approach (querying xklb-metadata.db directly). A/B testing will allow for a better choice. In the future, we may consider moving captions/subtitles from /library/calibre-web/xklb-metadata.db to /library/calibre-web/metadata.db

@EMG70
Copy link

EMG70 commented Sep 1, 2024

Ran sudo iiab-update -f on an existing VM
The search feature works OK in GUI and CLI as well.

Screenshot from 2024-09-01 13-56-41
Screenshot from 2024-09-01 14-04-25

@deldesir
Copy link
Collaborator Author

deldesir commented Sep 1, 2024

Ran sudo iiab-update -f on an existing VM The search feature works OK in GUI and CLI as well.

Thanks @EMG70. Can you test with an actual word or term spoken in the video. This term should not be part of the title. The idea is to get the right videos by searching for something you heard in the video. This should help when you don't remember in which video you heard this specific term.

@deldesir
Copy link
Collaborator Author

deldesir commented Sep 1, 2024

@EMG70, it looks like per your screenshot you forgot to include this PR (#152) in your test. To be successful, your lb-wrapper search command should return something in the same structure as the one displayed in this PR description here.

cf. how to test a PR

@EMG70
Copy link

EMG70 commented Sep 1, 2024

@EMG70, it looks like per your screenshot you forgot to include this PR (#152) in your test. To be successful, your lb-wrapper search command should return something in the same structure as the one displayed in this PR description here.

cf. how to test a PR

Sorry i mistakenly thought by running iiab-update -f would include the patch -26. I am redoing it now.

@EMG70
Copy link

EMG70 commented Sep 1, 2024

A new VM was created and ran PR#152 .
SUDO IIAB-DIAGNOSTICS - https://paste.centos.org/view/5c92fac9

The word " nitrogen" which features in the video https://www.youtube.com/watch?v=j2vm9cq9l9Y&t=7s was searched using GUI search,but did not return matches.

Screenshot from 2024-09-01 20-38-35

An attempt was made in Advanced serach box,this returned the correct result.

Screenshot from 2024-09-01 20-39-03

Screenshot from 2024-09-01 20-40-50

Although advanced search picked up the word nitrogen from the video,I would have expected to see the word nitrogen in the square brackets but only says 1 search result for [] in above screenshot.

Results for search via CLI below
Screenshot from 2024-09-01 20-48-12

@deldesir
Copy link
Collaborator Author

deldesir commented Sep 2, 2024

What's happening in your case is your lb-wrapper is not updated. To fix this, do in your vm terminal:

 cp /usr/local/calibre-web-py3/scripts/lb-wrapper /usr/local/bin/lb-wrapper

then try seaching again.

deldesir added a commit to deldesir/calibre-web that referenced this pull request Sep 2, 2024
@EMG70
Copy link

EMG70 commented Sep 2, 2024

What's happening in your case is your lb-wrapper is not updated. To fix this, do in your vm terminal:

 cp /usr/local/calibre-web-py3/scripts/lb-wrapper /usr/local/bin/lb-wrapper

then try seaching again.

@EMG70
Copy link

EMG70 commented Sep 2, 2024

All working now as you described after running cp /usr/local/calibre-web-py3/scripts/lb-wrapper /usr/local/bin/lb-wrapper

Screenshot from 2024-09-02 15-17-11

Screenshot from 2024-09-02 15-20-08

@holta
Copy link
Member

holta commented Sep 3, 2024

@deldesir showed me in a VM that searching (using Calibre-Web interface) takes 2-3 seconds with PR #152, whereas it takes much less than 1 second with PR #244.

This is confusing, as an extremely simple Python external call to /usr/local/bin/lb-wrapper (i.e. xklb) should definitely not be taking 2-3 seconds?

Mysterious! 🙃

@avni
Copy link
Member

avni commented Sep 4, 2024

@deldesir showed me in a VM that searching (using Calibre-Web interface) takes 2-3 seconds with PR #152, whereas it takes much less than 1 second with PR #244.

There seems to be a slight difference in speed between PR #152, and PR #244. Hard to measure the difference quantitatively from the front-end but it is perceptible. Here are two screen recordings in case this helps.

http://192.168.64.33 has PR #152 applied.
http://192.168.64.38 has PR #244 applied.

Screen.Recording.2024-09-04.at.1.04.34.AM.mov
Screen.Recording.2024-09-04.at.1.06.23.AM.mov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants