arXiv fetch is unreliable

## Description
The arXiv fetch script pulls license information from two unreliable fields:
https://github.com/creativecommons/quantifying/blob/6482d632849d13d4b7da10d084f54c5a26619d3f/scripts/1-fetch/arxiv_fetch.py#L353-L369
1. rights
   - ‼️ Does not exist. There is no `<rights>` entry in [5.2. Details of Atom Results Returned](https://info.arxiv.org/help/api/user-manual.html#52-details-of-atom-results-returned)
3. summary
   - Relies on a text match. However, the summary may talk about any number of legal tools without the paper itself using any of those legal tools.

## Reproduction
Add a logging statement to `extract_license_info()` function or examine the schema.

For example, a result for the query `[ALL:](all:"CC BY")` includes the following summary for [[2008.00774v3] Elsevier OA CC-By Corpus](http://arxiv.org/abs/2008.00774v3):
>  We introduce the Elsevier OA CC-BY corpus. This is the first open corpus of
Scientific Research papers which has a representative sample from across
scientific disciplines. This corpus not only includes the full text of the
article, but also the metadata of the documents, along with the bibliographic
information for each reference.

Here we can see the paper is **_about CC BY works_**, but the paper is licensed [arXiv.org - Non-exclusive license to distribute](https://arxiv.org/licenses/nonexclusive-distrib/1.0/license.html), not CC BY.

## Expectation
License information should be reliable.

Near as I can tell, the API does not provide licensing information. Licensing details would have to be scraped from each article's web page. Unfortunately, the structured RDF data on each article's web page doesn't include licensing information.

Web scraping is not currently in scope for this project.





## Additional context
- [License and copyright - arXiv info](https://info.arxiv.org/help/license/index.html)

## Resolution

- [ ] I would be interested in resolving this bug.

	def extract_license_info(entry):
	"""
	Extract CC license information from ArXiv paper entry.

	Checks rights field first, then summary field for license patterns.
	Returns normalized license identifier or "Unknown".
	"""
	# checking through the rights field first then summary
	if hasattr(entry, "rights") and entry.rights:
	license_info = normalize_license_text(entry.rights)
	if license_info != "Unknown":
	return license_info
	if hasattr(entry, "summary") and entry.summary:
	license_info = normalize_license_text(entry.summary)
	if license_info != "Unknown":
	return license_info
	return "Unknown"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

arXiv fetch is unreliable #236

Description

Reproduction

Expectation

Additional context

Resolution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

arXiv fetch is unreliable #236

Description

Description

Reproduction

Expectation

Additional context

Resolution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions