diff --git a/.gitignore b/.gitignore index 6e2b54a..f325864 100644 --- a/.gitignore +++ b/.gitignore @@ -1,9 +1,2 @@ -# Generated files -dummy_git_config.txt -git_history.json -target/ -.git_commit_message.txt -result -result/ -submodule_collector_strace.log -*~ \ No newline at end of file + +.emacs.d/ diff --git a/docs/crq_standardized/CRQ-059-wikipedia-wikidata-extractor.md b/docs/crq_standardized/CRQ-059-wikipedia-wikidata-extractor.md new file mode 100644 index 0000000..4e66d93 --- /dev/null +++ b/docs/crq_standardized/CRQ-059-wikipedia-wikidata-extractor.md @@ -0,0 +1,32 @@ +**CRQ: CRQ-059: Implement Wikipedia and Wikidata Extractor** + +**Problem/Goal:** +To expand our knowledge base by extracting content from Wikipedia articles and their associated Wikidata RDF facts. This data will be processed and stored in a structured file system cache for efficient retrieval and analysis. The extraction process needs to be robust, capable of parsing page content, extracting links, and associating data with version information from Wikipedia. + +**Proposed Solution:** + +1. **Create `wikipedia_extractor` Rust Crate:** (Already initiated) A new Rust crate named `wikipedia_extractor` will be developed to encapsulate all Wikipedia and Wikidata extraction logic. +2. **Define Data Structures:** + * Implement Rust structs to represent Wikipedia article content (text, links, metadata) and Wikidata RDF facts. + * Consider using existing crates for RDF parsing if available and suitable. +3. **Wikipedia Article Fetching:** + * Implement functionality to fetch Wikipedia article content given a URL or article title. + * Handle different content formats (e.g., HTML, MediaWiki API responses). +4. **Wikidata RDF Fact Extraction:** + * Integrate with the Wikidata API to retrieve RDF facts associated with Wikipedia articles. + * Parse and store these facts in the defined data structures. +5. **Link Extraction:** + * Develop logic to parse Wikipedia article HTML/content and extract internal and external links. +6. **Structured File System Cache:** + * Implement a mechanism to store the extracted Wikipedia content, Wikidata facts, and links in a structured file system. + * Include versioning based on Wikipedia's revision IDs or timestamps to ensure data freshness. +7. **Integration with URL Tool:** + * Modify the existing `url_extractor` or create a new "URL tool" that can identify Wikipedia URLs. + * When a Wikipedia URL is encountered, it will dispatch the URL to the `wikipedia_extractor` for processing. + * The `wikipedia_extractor` will then return the processed data to the calling tool or directly store it in the cache. +8. **Tests (Extreme Programming Approach):** + * Develop comprehensive unit and integration tests for each component of the `wikipedia_extractor` crate, following an Extreme Programming (XP) "test-first" approach. + * Tests will cover fetching, parsing, extraction, and caching functionalities. + +**Justification/Impact:** +This CRQ will significantly enhance our ability to gather and integrate structured knowledge from Wikipedia and Wikidata into our project's knowledge base. By automating this process, we can enrich our contextual understanding of various topics, improve data analysis capabilities, and support more intelligent agent behaviors. The structured caching with versioning will ensure data integrity and efficient access.