Skip to content

Commit

Permalink
Added JSON/XML parsers and revised Dynamic parser (#88)
Browse files Browse the repository at this point in the history
* Added json and xml parsers

* New dynamic parser

* Changed to v6.0.0

* Updated config.loader docs

* Fixed function config assign

* axios builder will be called before request

* Accidental renames

* Removed .js from imports

* Added unpairedTags
  • Loading branch information
JexSrs authored Oct 23, 2023
1 parent f798c3f commit db04d28
Show file tree
Hide file tree
Showing 55 changed files with 7,486 additions and 3,823 deletions.
34 changes: 18 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
- [WordPress V2](#wordpress-v2)
- [RSS](#rss)
- [HTML](#html)
- [JSON / XML](#json--xml)
- [Dynamic](#dynamic)
- [Which to choose](#which-to-choose)
- [Article](#article)
Expand Down Expand Up @@ -81,7 +82,7 @@ Read the [configuration](./docs/configuration.md) file for more information.
## Parsers

To retrieve the desired information from the websites we use parsers.
There are four available parser types: `wordpress`, `rss`, `html` and `dynamic`.
There are four available parser types: `wordpress`, `rss`, `html`, `api` and `dynamic`.

### WordPress V2

Expand All @@ -99,10 +100,16 @@ then you can safely use the `wordpress` parser.

Parser type: `rss`

Many websites support [`RSS`](https://en.wikipedia.org/wiki/RSS) feed. RSS allows users and applications to access
updates
Many websites support [`RSS`](https://en.wikipedia.org/wiki/RSS) feed. RSS allows users and applications to access updates
to websites in a standardized, computer-readable format. You can check if a website supports RSS if you can see this
icon <img src="/img/rss.png" width="20" height="20" />.
icon <img src="docs/rss.png" width="15" height="15" />.

### JSON / XML

Parser type: `json` (or `xml`)

This parser is best to be used when it comes to pages that are loading data using API requests (e.g. lazy loading).
The only prerequisite for this parser is that the response of the API requests is in a structured JSON or XML format.

### HTML

Expand All @@ -116,17 +123,16 @@ and CSS are not structured will be very difficult to scrape.

Parser type: `dynamic`

Unlike the other parsers, this parser uses javascript code to parse a website. All the logic for the scraping is
decided by the user.
Unlike the other parsers, this parser uses javascript/typescript code to parse a website. All the logic for the scraping is
decided by the user by extending the class `DynamicSourceFile`.

### Which to choose

We recommend a specific order for using the available parsers.

* If the desired website is based an [`WordPress`](https://wordpress.com/) and the WordPress articles API is enabled,
then
choose the `wordpress-v2` parser.
* If the desired website is based an [`WordPress`](https://wordpress.com/) and the WordPress articles API is enabled, then choose the `wordpress-v2` parser.
* If the desired website supports [`RSS`](https://en.wikipedia.org/wiki/RSS) feed. then choose the `rss` parser.
* If the desired website is loading data using API requests with structured responses (e.g. lazy loading), then choose the `json` or `xml` parser.
* If the desired website has a structured form, the use the `html` parser.
* If none of the above is possible (bad html or custom API) then the `dynamic` parser is our last choice.

Expand All @@ -145,13 +151,8 @@ These files are generated from the user and guide Saffron on how to parse a webs

### Creating a source file

Read the [source](./docs/source_files/source_file.md) file for the common options
or the parsers files
[WordPress V2](./docs/source_files/wordpress_v2.md),
[RSS](./docs/source_files/rss.md),
[HTML](./docs/source_files/html.md) or
[Dynamic](./docs/source_files/dynamic.md)
for the scrape options.
Read the [source](./docs/source_files/source_file.md) file for the common options or the parsers files
[WordPress V2](./docs/source_files/wordpress_v2.md), [RSS](./docs/source_files/rss.md), [API](./docs/source_files/json.md), [HTML](./docs/source_files/html.md) or [Dynamic](./docs/source_files/dynamic.md) for the scrape options.

## Middleware

Expand Down Expand Up @@ -222,6 +223,7 @@ try {
const result = Saffron.parse({
name: "source-name",
url: ["Category 1", "https://example.com"],
type: "html",
// ...
scrape: {
// ...
Expand Down
32 changes: 16 additions & 16 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,32 +34,26 @@ Default value: `true`

If `true` the Saffron will scan all the sub directories inside the `path` directory.

### `dynamicSourceFiles`
An array containing all the implementations for the dynamic source files.

### `loader`
Default value (for json files): `JSON.parse(fs.readFileSync)`\
Default value (for js files): `import (fallback require)`
Default value: `JSON.parse(fs.readFileSync(filepath))`

Will receive the filepath of the source file and expect the contents of that file
as a js object.
A custom loader that will allow to manually load each file to Saffron.

This is useful as Saffron does not have been converted to ES6 yet and the compiler
throws errors when importing dynamic source files (js files).
It was created to allow ES6 projects to load javascript files for the old dynamic parser.
Now it can be used to preprocess the files before passing them to Saffron.

```typescript
loader: async (filepath: string) => {
let data: any;
if (filepath.endsWith(".json")) {
data = JSON.parse(fs.readFileSync(filepath, 'utf-8'));
} else {
data = await import(filepath);
data = {...data.default};
}

let data = JSON.parse(fs.readFileSync(filepath, 'utf-8'));
// Proccess the content of the source file
// ...
return data;
}
```

It can also be used to modify the file's content before Saffron can parse them.

### `includeOnly`
Default value: `[]`

Expand Down Expand Up @@ -104,11 +98,17 @@ between the request that belong in the same source file.

### `axios`
Axios' configuration that will be applied to the requests made by saffron.
The method of the requests is by default `GET`, but it can be overridden here.

It supports synchronous callback like:

```typescript
axios: async (source: Source) => {
return {
method: 'POST',
data: {
key: 'value'
},
timeout: 3000
};
}
Expand Down
Binary file added docs/rss.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
144 changes: 67 additions & 77 deletions docs/source_files/dynamic.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,65 @@
# Dynamic parser

Unlike the others, the `dynamic` parser uses javascript files (instead of JSON).
The general options described [here](./source_file.md) would be included at the top level
of the `module.exports`inside the javascript file.
Unlike the others, the `dynamic` parser requires more configuration. We have
to initialize a `DynamicSourceFile` class and pass it to the configuration through the
[`dynamicSourceFiles`](../configuration.md#dynamicsourcefiles) option.

## DynamicSourceFile class
First we are going to initialize a class and extend the DynamicSourceFile class:
```ts
class Custom extends DynamicSourceFile {
// ...
}
```
After that we are going to implement the `name` method. This method will
return a unique string that will help Saffron to identify of the implementation.
```ts
class Custom extends DynamicSourceFile {
name(): string {
return "dynamic-1";
}
// ...
}
```
Following, we will implement the `request` method, which is responsible to do
all the network requests. The response should include one or multiple objects of
type `AxiosResponse`.

Below is a template for the dynamic parser source file. As you can see the scrape is a function
instead of JSON options.
```js
module.exports = {
type: "dynamic",
In cases where a login is required to the remote website, it can be done from here.

```ts
class Custom extends DynamicSourceFile {
// ...
scrape: async function (utils, Article) {
// Request and parse your articles...
const response = utils.get(utils.url);

// Create articles
const article = new Article();
article.title = '';
request(utils: Utils): Promise<RequestsResult> {
// Request using utils.get to assign the axios config
// passed in the global and/or source configurations
return utils.get(utils.url);
}
// ...
}
```
Lastly, we are going to implement the `parse` method, which is responsible to do
all the parsing. It will receive the requests responses from the `request` method
and must return an array of Articles.

```ts
class Custom extends DynamicSourceFile {
// ...
async parse(result: RequestsResult, utils: Utils): Promise<Article[]> {
const articles: Article[] = [];
// ...

if(error)
throw new Error('Failed for source file [name]!');

// Return an array of all the articles you want to be added
return articles;
}
}
```

# Scrape

Create the `scrape` asynchronous function to write down code needed for scraping.
Saffron will ignore the rest of the file when it comes to scrapping so all code must
be included inside the scrape function, such as imports, user-defined functions etc.

The `scrape` function will be called for every url mentioned in the general option
[`url`](./source_file.md#url). The categories mentioned here will be added automatically.
## Scrape

Any other libraries used must be added to your `package.json` file.
### `implementation`
The name of the implementation we have configured.

# Utils
## Utils
Utils provide a set of necessary functions and fields that are used by all scrappers.

### `isScrapeAfterError`
Expand Down Expand Up @@ -98,48 +118,23 @@ it will also remove any HTML tags.
It will accept HTML as string and extract the text and links of the following tags:
`a`, `img` and `link`.


# Article

Saffron offers the Article class to construct an Article object.

# Writing code

### Nested functions & Imports
Saffron will ignore the rest of the file when it comes to scrapping,
so all code must be included inside the scrape function,
such as imports, user-defined functions etc.

```js
scrape: async (utils, Article) => {
const log = (message) => {
console.log(message);
}

const foo = require('foo');
other.foo('bar');

log('Using a nested function.');
// ...
}
```
## Writing code

### Callbacks
Dynamic parser does not support callback response. If you want to use callbacks in your code you
have to return a promise:
Dynamic parser does not support callback response. In case the use of callbacks cannot be avoided
you can return a promise:

```javascript
scrape: (utils, Article) => {
return new Promise((resolve, reject) => {
utils.get(utils.url)
.then(response => {
// ...

resolve(articles);
})
.catch(reject);
});
}
return new Promise((resolve, reject) => {
request(utils.url, (response, errpr) => {
if (error != null) {
return reject(error)
}

// ...
resolve(response);
});
});
```

### Fail job
Expand All @@ -148,17 +143,12 @@ If you want to mark the current source scraping job as a failure and return no
articles then you have to throw an `Error`:

```js
scrape: async (utils, Article) => {
// ...
throw new Error("Parsing failed.");
}
throw new Error("Parsing failed.");
```
or reject the promise:
```js
scrape: async (utils, Article) => {
return new Promise((resolve, reject) => {
// ...
reject(new Error("Parsing failed."));
});
}
return new Promise((resolve, reject) => {
// ...
reject(new Error("Parsing failed."));
});
```
4 changes: 3 additions & 1 deletion docs/source_files/html.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# HTML parser

The `html` parser has the most difficult source files after the `dynamic` parser.
The `json`, `xml` and `html` parsers have the most difficult source files after the `dynamic` parser.
It is a JSON file, and it can ΝΟΤ run only by providing an url, a lot of configuration is needed.

It is a must to use tools such as the **inspect element tool** of the browser to find
Expand Down Expand Up @@ -166,6 +166,8 @@ will have the following options:
}
```

The extra options `title-second` and `title-third` will also be added to the `extras` field of the article.

### `static`
Assign a static string to the specified field. All the other options are omitted.

Expand Down
Loading

0 comments on commit db04d28

Please sign in to comment.