-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
342 lines (263 loc) · 12.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# lang
<!-- badges: start -->
[![R-CMD-check](https://github.com/mlverse/lang/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/mlverse/lang/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/mlverse/lang/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/lang?branch=main)
<!-- badges: end -->
Use an **LLM to translate a function's help documentation on-the-fly**. `lang`
overrides the `?` and `help()` functions in your R session. If you are using
RStudio or Positron, the translated help page will appear in the usual help
pane.
If you are a package developer, `lang` helps you translate your documentation,
and to include it as part of your package. `lang` will use the same `?` override
to display your translated help documents.
## Installation
To install the GitHub version of `lang`, use:
```r
install.packages("pak")
pak::pak("mlverse/lang")
```
## Using `lang`
If you have not used `mall` yet, then the first step is to set it up. Feel free
to follow the instructions in that package's
[Get Started](https://mlverse.github.io/mall/#get-started) page. Setting up
your LLM and `mall` should be a one time process.
On an every day R session, you'll just need to load `lang` and then tell
it which model to run using `llm_use()`:
```r
library(lang)
llm_use("ollama", "llama3.2", seed = 100)
```
After that, simply use `?` to trigger and display the translated documentation.
During translation, `lang` will display its progress by showing which section
of the documentation is currently translating:
```r
> ?lm
Translating: Title
```
If your environment is set to use the Spanish language, the help pane should
display this:
<img src="man/figures/lm-spanish.png" align="center"
alt="Screenshot of the lm function's help page in Spanish"/>
R enforces the printed name of each section, so they cannot be
translated. So titles such as Description, Usage and Arguments will always
remain untranslated.
### How it works
The language that the help documentation will be translated to, is determined by
one of the following two environment variables. In order of priority, the
variables are:
1. `LANGUAGE`
1. `LANG`
It is likely that your `LANG` variable already defaults to your locale.
For example, mine is set to: `en_US.UTF-8` (That means English, United States).
For someone in France, the locale would be something such as `fr_FR.UTF-8`.
Llama3.2, recognizes these UTF locales, and using `lang`, calling `?` will
result in translating the function's help documentation into French.
It uses the `mall` package as the integration point with the LLM. Under the hood,
it runs `llm_vec_translate()` multiple times to translate the most common
sections of the help documentation (e.g.: Title, Description, Details,
Arguments, etc.). If `lang` determines that your environment is set to use
English, it will simply display the original documentation.
### Considerations
#### Translation is not perfect
As you can imagine, the quality of translation will mostly depend on the LLM
being used. This solution is meant to be as helpful as possible, but
acknowledging that at this stage of LLMs, only a human curated translation
will be the best solution. Having said that, I believe that even an imperfect
translation could go a long way with someone who is struggling to understand
how to use a specific function in a package, and may also struggle with the
English language.
#### Debug
If the original English help page displays, check your environment variables:
```{r}
Sys.getenv("LANG")
Sys.getenv("LANGUAGE")
```
In my case, `lang` recognizes that the environment is set to English, because
of the `en` code in the variable. If your `LANG` variable is set to `en_...`
then no translation will occur.
If this is your case, set the `LANGUAGE` variable to your preference. You can
use the full language name, such as 'spanish', or 'french', etc. You can use
`Sys.setenv(LANGUAGE = "[my language]")`, or, for a more permanent solution,
add the entry to your your .Renviron file (`usethis::edit_r_environ()`).
## Package Developers
You may want to provide translations of your documentation as part of your
package.`lang` includes an entire infrastructure to help you to do the following:
- Let the LLM take the first pass at translating your documentation
- Easily edit the translations. This means, either you, or a collaborator, can
fine tune the new files
- Include the translated Rd files as part of your package
- Have `?` and `help()` pull from your translated documents
### LLM First pass
While inside your package's project, use `translate_roxygen()` to have `lang`
translate all of your documentation to the desired language. The function call
must include the target language, and the sub-folder to save the translated
files to:
```r
translate_roxygen("spanish", "es")
```
That function call will iterate through your **'R/'** folder and translate all of
your [`roxygen2`](https://roxygen2.r-lib.org/index.html) documentation. The
new Roxygen documents will be saved, by default, to a new **'man-lang/'** folder.
Make sure to add the new folder to your project **'.Rbuildignore'** file
(`^man-lang$`)
**ISO 639 codes** - The name of the sub-folder to use needs to be the two letter
designation of the target language you are using. That is why we used **es** for
Spanish. For the list of codes, you can refer to the
[Wikipedia page here](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes).
If you do not pass the `lang_sub_folder` argument, then `lang` will use the
`to_iso639()` function to automatically convert the value of `lang` to a
valide 2-character language code:
For this package, making that function call creates this console output:
```r
> translate_roxygen("spanish")
✔ 'spanish' converted to ISO 639 code: 'es'
ℹ Loading lang
[1/9] R/help-shims.R --> man-lang/es/help-shims.R
[2/9] R/iso-639.R --> man-lang/es/iso-639.R
[3/9] R/lang-help.R --> man-lang/es/lang-help.R
[4/9] R/lang.R --> [Skipping, no Roxygen content found]
[5/9] R/mall-reexports.R --> man-lang/es/mall-reexports.R
[6/9] R/process-roxygen.R --> man-lang/es/process-roxygen.R
[7/9] R/roxy-comments.R --> [Skipping, no Roxygen content found]
[8/9] R/translate-roxygen.R --> man-lang/es/translate-roxygen.R
[9/9] R/utils.R --> [Skipping, no Roxygen content found]
```
`lang` ties the resulting translated R scripts to the source R scripts by
adding a copy of the original Roxygen documentation. This way, it avoids
re-translating the content if nothing has changed:
```r
> translate_roxygen("spanish")
✔ 'spanish' converted to ISO 639 code: 'es'
ℹ Loading lang
[1/9] R/help-shims.R --> [Skipping, no changes detected]
[2/9] R/iso-639.R --> [Skipping, no changes detected]
[3/9] R/lang-help.R --> [Skipping, no changes detected]
[4/9] R/lang.R --> [Skipping, no Roxygen content found]
[5/9] R/mall-reexports.R --> [Skipping, no changes detected]
[6/9] R/process-roxygen.R --> [Skipping, no changes detected]
[7/9] R/roxy-comments.R --> [Skipping, no Roxygen content found]
[8/9] R/translate-roxygen.R --> [Skipping, no changes detected]
[9/9] R/utils.R --> [Skipping, no Roxygen content found]
```
### Edit the translations
As mentioned in the previous section, `lang` translates the functions'
Roxygen comments. This approach allows you as the developer to easily edit the
output.
For the `lang_help()` function, in the **'R/lang-help.R'** script, the top of
the documentation looks like this:
```r
#' Translates help
#' @description
#' Translates a given topic into a target language. It uses the `lang` argument
#' to determine which language to translate to. If not passed, this function will
#' look for a target language in the LANG and LANGUAGE environment variables to
#' determine the target language. If the target language is English, no translation
#' will be processed, so the help returned will be the original package's
#' documentation.
#'
#' @param topic The topic to search for
#' @param package The R package to look for the topic
#' @param lang Language to translate the help to
#' @param type Produce "html" or "text" output for the help. It default to
#' `getOption("help_type")`
...
```
And this is what the translation in **'man-lang/es/lang.R'** looks like:
```r
#' Ayuda en traducción
#' @description La función traduce un tema dado a un idioma objetivo. Utiliza
#' el argumento `lang` para determinar qué idioma traducir. Si no se pasa, esta
#' función busca un idioma objetivo en las variables de entorno LANG y LANGUAGE
#' para determinarlo. Si el idioma objetivo es inglés, no se procesa la
#' traducción, por lo que se devuelve la documentación original del paquete.
#' @param topic El tema de búsqueda principal.
#' @param package Paquete R para buscar el tema.
#' @param lang Please provide the text you'd like me to translate.
#' @param type Utilice "html" o "texto" como salida para la ayuda, de lo
#' contrario se utilizará el valor por defecto de `getOption("help_type")`.
...
```
Editing an R scripts Roxygen comments is a lot easier than editing an Rd file,
additionally, this solution integrates better with the usual package development
process.
It also opens the possibility to have collaborators to submit PRs to your package's
repository with edits to the translation, or even submit brand new translations.
### Include translations in your package
The Rd help files are still the best way for R to process and display your
help files. The second, and final step, will be to have `lang` create the
Rd files based on the translated Roxygen comments, simply run:
```r
process_roxygen()
```
That function will iterate through all the language sub-folders in
**'man-lang/'** to process the Rd files. The resulting Rd files will be saved to
**'inst/man-lang/'**. Please keep in mind that this step does not need an LLM
to work. It is only creating the Rd files, and putting them in the correct
location.
Under the hood, `lang` creates temporary copies of your package, replaces the
scripts in the 'R' folder with your translations, and then runs the
`roxygen2::roxygenize()` function. This ensures that the Rd creation is as
close as possible as if you were running `devtools::document()` during your
package development.
For this package, making that function call creates this console output:
```r
> process_roxygen()
ℹ Creating Rd files from man-lang/es (Spanish)
- ./inst/man-lang/es/help.Rd
- ./inst/man-lang/es/lang_help.Rd
- ./inst/man-lang/es/process_roxygen.Rd
- ./inst/man-lang/es/reexports.Rd
- ./inst/man-lang/es/to_iso639.Rd
- ./inst/man-lang/es/translate_roxygen.Rd
```
As an additional aid, `lang` will compare the Roxygen documentation in your
current **'R/'** folder, with the copy of the documentation made at the time
of translation. If there are differences, `lang` will show you a warning
indicating that a given translation may be out of date:
```r
> process_roxygen()
! The following R documentation has changed, translation may need to be revised:
|- R/translate-roxygen.R -x-> man-lang/es/translate-roxygen.R
ℹ Creating Rd files from man-lang/es (Spanish)
- ./inst/man-lang/es/help.Rd
- ./inst/man-lang/es/lang_help.Rd
- ./inst/man-lang/es/process_roxygen.Rd
- ./inst/man-lang/es/reexports.Rd
- ./inst/man-lang/es/to_iso639.Rd
- ./inst/man-lang/es/translate_roxygen.Rd
```
### Using your package's translations
The end-user can easily access your translations by making sure that `lang`
is loaded to their R session:
```r
library(lang)
Sys.setenv(LANGUAGE = "spanish")
?lang_help
```
`lang` always looks first in the **'inst/man-lan'** folder of your package
to see if there is a folder matching the end-user's language. If it does not
find one, it will then trigger a live translation of the function. This would be
the case if the user expect a French translation, but you only included a
Spanish one.
Instead of having the user wait for the LLM to complete the translation, if
`lang` finds a matching translation in your package, the help page will appear
almost instantly.
Under the hood, `lang` will use the value of your environment variables to
determine which sub-folder to check. If the value of `LANG` is a full locale
value (`en_US.UTF8`), then it will check in the folder matching the variables
first two characters exist. If the value is not a locale, `lang` will attempt to
translate the value into an ISO 639 code. This package contains a small
conversion table to do its best to infer the language you are using, and thus
to know which sub-folder to look for.