Skip to content

Commit

Permalink
v1.0.0
Browse files Browse the repository at this point in the history
Add parsing sitemaps feature. Also, the required PHP version is now 8.0.
  • Loading branch information
otsch committed Sep 22, 2022
1 parent 504b5f0 commit 2d0d00e
Show file tree
Hide file tree
Showing 13 changed files with 111 additions and 101 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
php-versions: ['7.4', '8.0']
php-versions: ['7.4', '8.0', '8.1', '8.2']

steps:
- name: Checkout code
Expand Down
6 changes: 4 additions & 2 deletions .php-cs-fixer.dist.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@

$config = new PhpCsFixer\Config();

return $config->setRules([
return $config->setFinder($finder)
->setRules([
'@PSR12' => true,
'strict_param' => true,
'array_syntax' => ['syntax' => 'short'],
])
->setFinder($finder);
->setRiskyAllowed(true)
->setUsingCache(true);
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [1.0.0] - 2022-09-22
### Changed
- Required PHP version is now 8.0.

### Added
- It now also parses `Sitemap:` lines. You can get all referenced sitemaps via the `sitemaps()` method of the `RobotsTxt` class.

## [0.1.2] - 2022-09-16
### Fixed
- Also allow usage of crwlr/url 1.0 as it's not a problem at all and the PHP version requirement of this package is still `^7.4|^8.0`.
Expand Down
6 changes: 3 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,15 @@ Linting can be executed using the `composer cs` command.
When you're making changes to this package please always run
unit tests, CS Fixer and PHPStan. Commands:
`composer test`
`composer cs`
`composer cs` or `composer cs-fix`
`composer stan`

Ideally you add the pre-commit git hook that is shipped with
this repo that will run tests and linting. Add it to your local
clone by running:
`composer add-git-hooks`

Also please don't forget to add new test cases if necessary.
Also, please don't forget to add new test cases if necessary.

### Documentation

Expand All @@ -56,5 +56,5 @@ For any code change please don't forget to add an entry to the
## Appreciation

When your pull request is merged I will show some love and tweet
about it. Also if you meet me in person I will be glad to buy you
about it. Also, if you meet me in person I will be glad to buy you
a beer.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2021 Christian Olear
Copyright (c) 2022 Christian Olear

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
Expand Down
35 changes: 6 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,15 @@
<p align="center"><a href="https://www.crwlr.software" target="_blank"><img src="https://github.com/crwlrsoft/graphics/blob/eee6cf48ee491b538d11b9acd7ee71fbcdbe3a09/crwlr-logo.png" alt="crwlr.software logo" width="260"></a></p>

# Robots Exclusion Standard/Protocol Parser
## for Web Crawling/Scraping

Use this library within crawler/scraper programs to parse robots.txt
files and check if your crawler user-agent is allowed to load certain
paths.

## Requirements

Requires PHP version 7.4 or above.

## Installation

Install the latest version with:

```sh
composer require crwlr/robots-txt
```

## Usage

```php
use Crwlr\RobotsTxt\RobotsTxt;

$robotsTxtContent = file_get_contents('https://www.crwlr.software/robots.txt');
$robotsTxt = RobotsTxt::parse($robotsTxtContent);

$robotsTxt->isAllowed('/packages', 'MyBotName');
```
## Documentation
You can find the documentation at [crwlr.software](https://www.crwlr.software/packages/robots-txt/getting-started).

You can also check with an absolute url.
But attention: the library won't (/can't) check if the host of your
absolute url is the same as the robots.txt file was on (because it
doesn't know the host where it's on, you just give it the content).
## Contributing

```php
$robotsTxt->isAllowed('https://www.crwlr.software/packages', 'MyBotName');
```
If you consider contributing something to this package, read the [contribution guide (CONTRIBUTING.md)](CONTRIBUTING.md).
5 changes: 3 additions & 2 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
"docs": "https://www.crwlr.software/packages/robots-txt"
},
"require": {
"php": "^7.4|^8.0",
"php": "^8.0",
"crwlr/url": "^1.0|^2.0"
},
"require-dev": {
Expand All @@ -48,7 +48,8 @@
},
"scripts": {
"test": "@php vendor/bin/phpunit",
"cs": "PHP_CS_FIXER_IGNORE_ENV=1 php vendor/bin/php-cs-fixer fix -v --diff --dry-run --allow-risky=yes",
"cs": "@php vendor/bin/php-cs-fixer fix -v --diff --dry-run",
"cs-fix": "@php vendor/bin/php-cs-fixer fix -v",
"stan": "@php vendor/bin/phpstan analyse -c phpstan.neon",
"add-git-hooks": "@php bin/add-git-hooks"
}
Expand Down
32 changes: 18 additions & 14 deletions src/Parser.php
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,12 @@
final class Parser
{
/**
* @param string $robotsTxtContent
* @return RobotsTxt
* @throws InvalidRobotsTxtFileException
*/
public function parse(string $robotsTxtContent): RobotsTxt
{
$lines = explode("\n", $robotsTxtContent);
$userAgentGroups = [];
$userAgentGroups = $sitemaps = [];

for ($lineNumber = 0; $lineNumber < count($lines); $lineNumber++) {
$line = $this->getLine($lines, $lineNumber);
Expand All @@ -31,25 +29,25 @@ public function parse(string $robotsTxtContent): RobotsTxt

$this->addRuleToUserAgentGroup($line, $userAgentGroup);
}

if ($this->isSitemapLine($line)) {
$sitemaps[] = $this->getSitemapFromLine($line);
}
}

return new RobotsTxt($userAgentGroups);
return new RobotsTxt($userAgentGroups, $sitemaps);
}

/**
* @param string[] $lines
* @param int $lineNumber
* @return string
*/
private function getLine(array $lines, int $lineNumber): string
{
return trim($lines[$lineNumber]);
}

/**
* @param array|string[] $lines
* @param int $lineNumber
* @return string|null
* @param string[] $lines
*/
private function getNextLine(array $lines, int $lineNumber): ?string
{
Expand All @@ -70,6 +68,11 @@ private function isRuleLine(string $line): bool
return $this->isDisallowLine($line) || $this->isAllowLine($line);
}

private function isSitemapLine(string $line): bool
{
return preg_match('/^\s?sitemap\s?:/i', $line) === 1;
}

private function isDisallowLine(string $line): bool
{
return preg_match('/^\s?disallow\s?:/i', $line) === 1;
Expand All @@ -81,7 +84,7 @@ private function isAllowLine(string $line): bool
}

/**
* @param array|string[] $lines
* @param string[] $lines
*/
private function makeUserAgentGroup(array $lines, string $line, int &$lineNumber): UserAgentGroup
{
Expand Down Expand Up @@ -112,15 +115,16 @@ private function addRuleToUserAgentGroup(string $line, UserAgentGroup $userAgent
}
}

/**
* @param string $line
* @return string
*/
private function getUserAgentFromLine(string $line): string
{
return $this->getStringAfterFirstColon($line);
}

private function getSitemapFromLine(string $line): string
{
return $this->getStringAfterFirstColon($line);
}

private function getPatternFromRuleLine(string $line): string
{
$lineAfterFirstColon = $this->getStringAfterFirstColon($line);
Expand Down
30 changes: 18 additions & 12 deletions src/RobotsTxt.php
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,16 @@
namespace Crwlr\RobotsTxt;

use Crwlr\RobotsTxt\Exceptions\InvalidRobotsTxtFileException;
use Exception;
use InvalidArgumentException;

final class RobotsTxt
{
/**
* @var array|UserAgentGroup[]
* @param UserAgentGroup[] $userAgentGroups
* @param string[] $sitemaps
*/
private array $userAgentGroups = [];

/**
* @param array|UserAgentGroup[] $userAgentGroups
*/
public function __construct(array $userAgentGroups)
public function __construct(private array $userAgentGroups, private array $sitemaps = [])
{
foreach ($userAgentGroups as $userAgentGroup) {
if (!$userAgentGroup instanceof UserAgentGroup) {
Expand All @@ -24,8 +21,6 @@ public function __construct(array $userAgentGroups)
);
}
}

$this->userAgentGroups = $userAgentGroups;
}

/**
Expand All @@ -37,13 +32,24 @@ public static function parse(string $robotsTxtContent): RobotsTxt
}

/**
* @return array|UserAgentGroup[]
* @return UserAgentGroup[]
*/
public function groups(): array
{
return $this->userAgentGroups;
}

/**
* @return string[]
*/
public function sitemaps(): array
{
return $this->sitemaps;
}

/**
* @throws Exception
*/
public function isAllowed(string $uri, string $userAgent): bool
{
$matchingGroups = $this->getGroupsMatchingUserAgent($userAgent);
Expand All @@ -61,7 +67,7 @@ public function isAllowed(string $uri, string $userAgent): bool
/**
* Find all groups that match a certain user agent string.
*
* @return array|UserAgentGroup[]
* @return UserAgentGroup[]
*/
private function getGroupsMatchingUserAgent(string $userAgent): array
{
Expand All @@ -77,7 +83,7 @@ private function getGroupsMatchingUserAgent(string $userAgent): array
}

/**
* @param array|UserAgentGroup[] $groups
* @param UserAgentGroup[] $groups
*/
private function combineGroups(array $groups): UserAgentGroup
{
Expand Down
10 changes: 3 additions & 7 deletions src/RulePattern.php
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
namespace Crwlr\RobotsTxt;

use Crwlr\Url\Url;
use InvalidArgumentException;
use Exception;

final class RulePattern
{
Expand All @@ -21,14 +21,10 @@ public function pattern(): string
}

/**
* @param string|Url|mixed $uri
* @throws Exception
*/
public function matches($uri): bool
public function matches(string|Url $uri): bool
{
if (!$uri instanceof Url && !is_string($uri)) {
throw new InvalidArgumentException('Argument $uri must be a string or instance of Crwlr\Url.');
}

$path = $uri instanceof Url ? $uri->path() : Url::parse($uri)->path();

if (!is_string($path)) {
Expand Down
Loading

0 comments on commit 2d0d00e

Please sign in to comment.