From de9d5aa388275f4343236f4c6606da5e0faf7638 Mon Sep 17 00:00:00 2001 From: charlottejmc <143802849+charlottejmc@users.noreply.github.com> Date: Fri, 4 Oct 2024 10:49:04 +0800 Subject: [PATCH 1/3] Update working-with-batches-of-pdf-files.md Add -L redirect to the curl parameter --- en/lessons/working-with-batches-of-pdf-files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/en/lessons/working-with-batches-of-pdf-files.md b/en/lessons/working-with-batches-of-pdf-files.md index 6324a6a89..c2336374f 100644 --- a/en/lessons/working-with-batches-of-pdf-files.md +++ b/en/lessons/working-with-batches-of-pdf-files.md @@ -298,7 +298,7 @@ You can download the corpus from the [ILO website](https://www.ilo.org/public/li To automate this step you can use the following command line commands. This will download all English documents (340 files) at once. ``` bash -curl https://www.ilo.org/public/libdoc/ilo/ILO-SR/ | +curl -L https://www.ilo.org/public/libdoc/ilo/ILO-SR/ | grep -o 'ILO[^"]*engl[^"><\/]*' | uniq | sed 's,ILO,https://www.ilo.org/public/libdoc/ilo/ILO-SR/ILO,g' > list_of_files.txt From 4e84c24006c7b396aca68cad09f51b44bf6ea1f9 Mon Sep 17 00:00:00 2001 From: charlottejmc <143802849+charlottejmc@users.noreply.github.com> Date: Wed, 9 Oct 2024 14:19:25 +0800 Subject: [PATCH 2/3] Update working-with-batches-of-pdf-files.md Update ILO links to the new webapps redirect --- en/lessons/working-with-batches-of-pdf-files.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/en/lessons/working-with-batches-of-pdf-files.md b/en/lessons/working-with-batches-of-pdf-files.md index c2336374f..1fc505269 100644 --- a/en/lessons/working-with-batches-of-pdf-files.md +++ b/en/lessons/working-with-batches-of-pdf-files.md @@ -138,15 +138,15 @@ Throughout the lesson I will assume that 'proghist' is your working directory. Save all files below to your working directory: - - [Classification of industries](https://www.ilo.org/public/libdoc/ilo/ILO-SR/ILO-SR_N1_engl.pdf) - - [Statistics of wages and hours of labour](https://www.ilo.org/public/libdoc/ilo/ILO-SR/ILO-SR_N2_engl.pdf) - - [Statistics of industrial accidents](https://www.ilo.org/public/libdoc/ilo/ILO-SR/ILO-SR_N3_engl.pdf) - - [Report of the Conference](https://www.ilo.org/public/libdoc/ilo/ILO-SR/ILO-SR_N4_engl.pdf) - - [International labour review](https://www.ilo.org/public/libdoc/ilo/P/09602/09602(1924-9-1)3-30.pdf) + - [Classification of industries](https://webapps.ilo.org/public/libdoc/ilo/ILO-SR/ILO-SR_N1_engl.pdf) + - [Statistics of wages and hours of labour](https://webapps.ilo.org/public/libdoc/ilo/ILO-SR/ILO-SR_N2_engl.pdf) + - [Statistics of industrial accidents](https://webapps.ilo.org/public/libdoc/ilo/ILO-SR/ILO-SR_N3_engl.pdf) + - [Report of the Conference](https://webapps.ilo.org/public/libdoc/ilo/ILO-SR/ILO-SR_N4_engl.pdf) + - [International labour review](https://webapps.ilo.org/public/libdoc/ilo/P/09602/09602(1924-9-1)3-30.pdf) To illustrate image extraction and PDF merging you will include one more files to our corpus that is not directly related to the First International Conference of Labour Statisticians from 1923. - - [Speeches made at the ceremony on 21 October 1923](https://www.ilo.org/public/libdoc/ilo/1923/23B09_5_engl.pdf) + - [Speeches made at the ceremony on 21 October 1923](https://webapps.ilo.org/public/libdoc/ilo/1923/23B09_5_engl.pdf) For the Topic Modelling of the [case study](#use-topic-modelling-to-analyze-the-corpus) you will download more files later in the lesson. @@ -293,12 +293,12 @@ mkdir case_study cd case_study ``` -You can download the corpus from the [ILO website](https://www.ilo.org/public/libdoc/ilo/ILO-SR/). All English documents contain ‘engl’ in the title. It’s over a gigabyte of data. Depending on your internet speed this may take a while. +You can download the corpus from the [ILO website](https://webapps.ilo.org/public/libdoc/ilo/ILO-SR/). All English documents contain ‘engl’ in the title. It’s over a gigabyte of data. Depending on your internet speed this may take a while. To automate this step you can use the following command line commands. This will download all English documents (340 files) at once. ``` bash -curl -L https://www.ilo.org/public/libdoc/ilo/ILO-SR/ | +curl https://webapps.ilo.org/public/libdoc/ilo/ILO-SR/ | grep -o 'ILO[^"]*engl[^"><\/]*' | uniq | sed 's,ILO,https://www.ilo.org/public/libdoc/ilo/ILO-SR/ILO,g' > list_of_files.txt From 27c915f071cf7d20dfb19962e70f7d831c01dda2 Mon Sep 17 00:00:00 2001 From: charlottejmc <143802849+charlottejmc@users.noreply.github.com> Date: Thu, 10 Oct 2024 16:28:39 +0800 Subject: [PATCH 3/3] Update working-with-batches-of-pdf-files.md Update the link in the `sed` command --- en/lessons/working-with-batches-of-pdf-files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/en/lessons/working-with-batches-of-pdf-files.md b/en/lessons/working-with-batches-of-pdf-files.md index 1fc505269..fa9c9d3c9 100644 --- a/en/lessons/working-with-batches-of-pdf-files.md +++ b/en/lessons/working-with-batches-of-pdf-files.md @@ -301,7 +301,7 @@ To automate this step you can use the following command line commands. This will curl https://webapps.ilo.org/public/libdoc/ilo/ILO-SR/ | grep -o 'ILO[^"]*engl[^"><\/]*' | uniq | -sed 's,ILO,https://www.ilo.org/public/libdoc/ilo/ILO-SR/ILO,g' > list_of_files.txt +sed 's,ILO,https://webapps.ilo.org/public/libdoc/ilo/ILO-SR/ILO,g' > list_of_files.txt xargs -n 1 curl -O < list_of_files.txt rm list_of_files.txt ```