-
-
Notifications
You must be signed in to change notification settings - Fork 70
Description
Description
The arXiv fetch script pulls license information from two unreliable fields:
quantifying/scripts/1-fetch/arxiv_fetch.py
Lines 353 to 369 in 6482d63
| def extract_license_info(entry): | |
| """ | |
| Extract CC license information from ArXiv paper entry. | |
| Checks rights field first, then summary field for license patterns. | |
| Returns normalized license identifier or "Unknown". | |
| """ | |
| # checking through the rights field first then summary | |
| if hasattr(entry, "rights") and entry.rights: | |
| license_info = normalize_license_text(entry.rights) | |
| if license_info != "Unknown": | |
| return license_info | |
| if hasattr(entry, "summary") and entry.summary: | |
| license_info = normalize_license_text(entry.summary) | |
| if license_info != "Unknown": | |
| return license_info | |
| return "Unknown" |
- rights
‼️ Does not exist. There is no<rights>entry in 5.2. Details of Atom Results Returned
- summary
- Relies on a text match. However, the summary may talk about any number of legal tools without the paper itself using any of those legal tools.
Reproduction
Add a logging statement to extract_license_info() function or examine the schema.
For example, a result for the query [ALL:](all:"CC BY") includes the following summary for [2008.00774v3] Elsevier OA CC-By Corpus:
We introduce the Elsevier OA CC-BY corpus. This is the first open corpus of
Scientific Research papers which has a representative sample from across
scientific disciplines. This corpus not only includes the full text of the
article, but also the metadata of the documents, along with the bibliographic
information for each reference.
Here we can see the paper is about CC BY works, but the paper is licensed arXiv.org - Non-exclusive license to distribute, not CC BY.
Expectation
License information should be reliable.
Near as I can tell, the API does not provide licensing information. Licensing details would have to be scraped from each article's web page. Unfortunately, the structured RDF data on each article's web page doesn't include licensing information.
Web scraping is not currently in scope for this project.
Additional context
Resolution
- I would be interested in resolving this bug.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status