Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Camofy html image sources #467

Merged
merged 6 commits into from
Jan 9, 2025
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 31 additions & 3 deletions app/helpers/markdown_helper.rb
Original file line number Diff line number Diff line change
@@ -1,16 +1,44 @@
module MarkdownHelper
include CamoHelper

def camofy(markdown)
return unless markdown
def camofy(text)
return unless text

markdown.gsub(markdown_img_regex) do
text = sub_markdown(text)
sub_html(text)
end

def sub_markdown(text)
text.gsub(markdown_img_regex) do
"![#{Regexp.last_match(1)}](#{camo(Regexp.last_match(2))}#{Regexp.last_match(3)})"
end
end

def sub_html(text)
text.gsub(html_img_regex) do
preceding_src = Regexp.last_match(1)
quote_mark = Regexp.last_match(2)
url = Regexp.last_match(3)
ending = Regexp.last_match(4)
"<img#{preceding_src} src=#{quote_mark}#{camo(url)}#{quote_mark}#{ending}"
end
end

def markdown_img_regex
# ![alt text](url =widthxheight "title")
/!\[([^\[]*)\]\(([^\ )]+)(( =[^)]?[^ ]+)?( [^)]?"[^)]+")?)?\)/
end

def html_img_regex
# warning: this regex may not be perfect. They rarely are.
# If you find an edge case, improve this regex!
# <img...something... src="url"...ending...
# or, the alternative quotes: <img...something... src='url'...ending...
# or, even without quotes: <img...something... src=url...ending...
# and the ...ending... can be either a space, a > or />
# note that we don't allow mismatched quotes like 'url" or shenanigans like that
# This regex contains two particularly useful features:
# capturing groups, and lazy matching.
%r{<img([^>]*) src=(["']?)(.+?)\2( |>|/>)}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't ask me why I had to put %r{} around it, I just followed rubocop's instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any documentaion online that i can use as a guide line for the review?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've never seen regexes before, they are a pain in the ass to review. They are kind of like a language in their own right, and the main way to understand regexes is to play around building them and testing them out on a site that shows you what the regex matches.
I've taken the regex here and prepared an example for you: https://regex101.com/r/RLd8UL/1
If you'd like to have a quick online meeting about how to understand this, let me know. I have time tomorrow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen regex before, and it is almost unreadable. thank you for the resource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way to test the regex is probably to come up with valid HTML img tags that are not matched by the regex. The edge cases I could think of have been captured in this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am cheating a little bit by using chatgpt, it has some suggestions and explaination, i will check it and comment it

Copy link
Contributor

@lodewiges lodewiges Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%r{<img([^>]*) src=(["']?)(.+?)\2( |>|/>)}
%r{<img([^>])*\s+src=(["']?)([^'">]+)\1(?=[/>])}

this one also checks for space in between the elements and has better checking for the imageurl

Copy link
Contributor Author

@DrumsnChocolate DrumsnChocolate Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you seem to have removed the first capturing group. chatgpt has that tendency because it doesn't see you using its result anywhere in your regex. However, we definitely use this capturing group to produce the new text. What I do like, is the use of \s; that's any whitespace right?

What exactly does the latter part improve? Because I don't think it really improves anything. Let me break my thoughts down:

  • original: (.+?)\2( |>|/>) where \2 matches to the same value as the captured opening quote (either ', " or nothing at all). (.+?) lazily matches any character. That means that it will begin with src=' or src=" or src= and then it will continue to match any character until the first time it encounters that same quotation mark again. Subsequently, after the quotation mark, it needs to match one of the three options in ( |>|/>) which are a space, the > or />.
  • your suggestion: ([^'">]+)\1(?=[/>])} where \1 matches the same value as the captured opening quote. Note that [^'">] is unnecessarily restrictive: I don't know exactly the specification of HTML, but I can image that something like src='lookatthisdoublequote"isntitamazing' is valid. Notice the " in the middle? Your regex suggestion would stop the src value at that middle quote, because it does not pass the [^'"] check. And the latter part, (?=[/>]) does not allow for any whitespace.

I do have some ideas based on your suggestion:

  • use \s instead of spaces when I'm indicating whitespace
  • I've reconsidered whether the last capturing group is necessary, but yes it is, because when we have src=somesource, we can only determine the end of the source value when we encounter a whitespace or / or >. But I can change it from ( |>|/>) to ( |>|/) which can be put simpler by writing [ >/]. However, if I want to match for \s instead of a space, I can't use the [] notation, I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've applied the new ideas I mentioned in the latest commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part was my bad "(?=[/>]) does not allow for any whitespace'' i asked chat to remove the whitespace.
I was unaware that this is a valid source "src='lookatthisdoublequote"isntitamazing'", but i can agree that it is to restictive

end
end
37 changes: 37 additions & 0 deletions spec/helpers/markdown_helper_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,43 @@
'![](https://example.org/c7125941763fc18c9d8977ed19028ca5f9378070/687474703a2f2f6578616d706c652e6f72672f696d6167652e6a7067 =100x* "Image title")'
)
end

it do
expect(camofy('<img src="http://example.org/image.jpg">')).to eq(
'<img src="https://example.org/c7125941763fc18c9d8977ed19028ca5f9378070/687474703a2f2f6578616d706c652e6f72672f696d6167652e6a7067">'
)
end

it do
expect(camofy("<img src='http://example.org/image.jpg'>")).to eq(
"<img src='https://example.org/c7125941763fc18c9d8977ed19028ca5f9378070/687474703a2f2f6578616d706c652e6f72672f696d6167652e6a7067'>"
)
end

it do
expect(camofy('<img src="http://example.org/image.jpg"/>')).to eq(
'<img src="https://example.org/c7125941763fc18c9d8977ed19028ca5f9378070/687474703a2f2f6578616d706c652e6f72672f696d6167652e6a7067"/>'
)
end

it do
expect(camofy('<img src="http://example.org/image.jpg" />')).to eq(
'<img src="https://example.org/c7125941763fc18c9d8977ed19028ca5f9378070/687474703a2f2f6578616d706c652e6f72672f696d6167652e6a7067" />'
)
end

it do
expect(camofy('<img src="http://example.org/image.jpg" style="somekindofstyle" >')).to eq(
'<img src="https://example.org/c7125941763fc18c9d8977ed19028ca5f9378070/687474703a2f2f6578616d706c652e6f72672f696d6167652e6a7067" style="somekindofstyle" >'
)
end

it do
expect(camofy('<img alt="" style="somekindofstyle" src="http://example.org/image.jpg">')).to eq(
'<img alt="" style="somekindofstyle" src="https://example.org/c7125941763fc18c9d8977ed19028ca5f9378070/687474703a2f2f6578616d706c652e6f72672f696d6167652e6a7067">'
)
end

# rubocop:enable Layout/LineLength
end
end
Loading