Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for md ref links, others #6

Merged
merged 7 commits into from
Jul 1, 2021
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 28 additions & 15 deletions src/scrapeLinks.test.js
Original file line number Diff line number Diff line change
@@ -1,47 +1,60 @@
import scrapeLinks from "./scrapeLinks";

const testMarkdownString = `
This is a Markdown example with [a link to google](https://www.google.com) and [one with a subdirectory](https://www.google.com/nested/page.html)
const markdownString = `
Markdown example with a [link to Google](https://www.google.com), one [with a URL path](https://www.google.com/nested/page.html), and others:

and [another to reddit](www.reddit.com) and [a third to Twitter](facebook.com)
- One [to reddit](www.reddit.com)
- A fourth [to Facebook](facebook.com) (incomplete URLs)
- Finally a few [ref] [links][links] [here][link-here]

as well as some blank lines
[ref]: https://www.ref.com
[links]:
www.links.in/newline
[link-here]:
/just/a/path
Comment on lines +11 to +15
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test this...


There's also some blank lines, misc. text, and <span>HTML</span> code.
`;

const plaintextString = `
This string is plaintext, with links like https://www.google.com and https://www.google.com/nested/page.html

I can scrape "https://reddit.com/r/subreddit" and (https://facebook.com) as well!
I can scrape "https://reddit.com/r/subreddit" and (https://facebook.com) as well! The new regex can pull www.youtube.com too!?

The new regex can pull www.youtube.com too!? unfortunately, gmail.com is just too vague.
TODO: Unfortunately, gmail.com is just too vague.
TODO: Ending in a period won't work well either, e.g. www.something.com.
`;

const plaintextTestResult = [
const markdownTestResult = [
"https://www.google.com",
"https://www.google.com/nested/page.html",
"https://reddit.com/r/subreddit",
"https://facebook.com",
"www.youtube.com",
"www.reddit.com",
"facebook.com",
"https://www.ref.com",
"www.links.in/newline",
"/just/a/path",
Comment on lines +34 to +36
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vs. this

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm BTW jest is failing with

    - Expected  - 2
    + Received  + 0
...
    -   "www.links.in/newline",
    -   "/just/a/path",

But if I remove these 2 lines then it fails with

    - Expected  - 0
    + Received  + 2
...
    +   "www.links.in/newline",
    +   "/just/a/path",

🤔 Yelp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you're looking at two different tests, one that splits the string into newlines and one that doesn't.

I just removed the newline split one and changed the inner workings so we don't get string arrays as content (by joining them before returning as content). That should solve this weirdness and allow us to pick up links that take multiple lines like this.

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I thought it had something to do with the new line even though I put \n in my regex. I thought maybe it was separately matching those links as plan links which is why I added other plain links in the sample (#6 (review) below).

TBH I'm not 100% what the change to the other files do but the tests are passing so thanks! Do you approve this PR now? 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I'm not 100% what the change to the other files do

All just internal stuff, basically this project has a two ways of gathering links: file and GH diff. The GH diff method used to return an array of strings before since I had assumed no links span multiple lines, but that's clearly not the case- now that parser joins the array into a string before returning it, similarly to the filesystem-based parser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you approve this PR now?

Sorry for the delay! I'm trying this PR out manually by running it on the CLI, and noticed some things:

  1. dvc.org's prettier rules are butchering these ref links, forcing them onto the same line
  2. when forced with --no-verify, these links aren't picked up

I test it out manually by running yarn build, going to a local dvc.org directory, and adding the MD test string to a random MD file. The links were found when prettier forced them on the same line, but multi-line ones aren't found even after all the changes.

Copy link
Contributor

@rogermparent rogermparent Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found a change that sort of fixes it (removing + from the start of the git diff lines), but for some reason /just/a/path is still not found. I forgot, local links aren't checked locally! I think after this small patch we should be good to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch is in! I can deploy tomorrow when I'm a bit less tired, I don't want to rush and mess it up. However, merging in the source repo won't break anything in prod so I can approve/merge. Thanks @jorgeorpinel!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thank you so much @rogermparent ! I see you did some more tweaks to address some of the things you mentioned. Let's see what it does on dvc.org once deployed 😬

];

const markdownTestResult = [
const plaintextTestResult = [
"https://www.google.com",
"https://www.google.com/nested/page.html",
"www.reddit.com",
"facebook.com",
"https://reddit.com/r/subreddit",
"https://facebook.com",
"www.youtube.com",
"www.something.com.",
];

test("It scrapes from the markdown test string", () => {
expect(
scrapeLinks({
filePath: "test.md",
content: testMarkdownString,
content: markdownString,
})
).toEqual(markdownTestResult);
});

test("It scrapes from the markdown test split by newlines", () => {
const splitTest = testMarkdownString.split("\n");
const splitTest = markdownString.split("\n");
expect(
scrapeLinks({
filePath: "test.md",
Expand Down
7 changes: 6 additions & 1 deletion src/scrapeLinks.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,13 @@ const scrapeFromString: (filePath: string, content: string) => string[] = (
/\[.*?\]\((?:<((?:\(.*?\)|.)*?)>|((?:\(.*?\)|.)*?))(?: ["'].*?["'])?\)/gm,
(x) => x[2] || x[1]
);
const mdRefLinks = regexMap(
content,
/\[.*\]:\s*\n?\s*(.*)/gm,
(x) => x[2] || x[1]
);
const hrefLinks = regexMap(content, /href="(.*?)"/gm);
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
const links = mdLinks.concat(hrefLinks);
const links = mdLinks.concat(mdRefLinks).concat(hrefLinks);
return links
? links
.filter(Boolean)
Expand Down