Skip to content

arXiv fetch is unreliable #236

@TimidRobot

Description

@TimidRobot

Description

The arXiv fetch script pulls license information from two unreliable fields:

def extract_license_info(entry):
"""
Extract CC license information from ArXiv paper entry.
Checks rights field first, then summary field for license patterns.
Returns normalized license identifier or "Unknown".
"""
# checking through the rights field first then summary
if hasattr(entry, "rights") and entry.rights:
license_info = normalize_license_text(entry.rights)
if license_info != "Unknown":
return license_info
if hasattr(entry, "summary") and entry.summary:
license_info = normalize_license_text(entry.summary)
if license_info != "Unknown":
return license_info
return "Unknown"

  1. rights
  2. summary
    • Relies on a text match. However, the summary may talk about any number of legal tools without the paper itself using any of those legal tools.

Reproduction

Add a logging statement to extract_license_info() function or examine the schema.

For example, a result for the query [ALL:](all:"CC BY") includes the following summary for [2008.00774v3] Elsevier OA CC-By Corpus:

We introduce the Elsevier OA CC-BY corpus. This is the first open corpus of
Scientific Research papers which has a representative sample from across
scientific disciplines. This corpus not only includes the full text of the
article, but also the metadata of the documents, along with the bibliographic
information for each reference.

Here we can see the paper is about CC BY works, but the paper is licensed arXiv.org - Non-exclusive license to distribute, not CC BY.

Expectation

License information should be reliable.

Near as I can tell, the API does not provide licensing information. Licensing details would have to be scraped from each article's web page. Unfortunately, the structured RDF data on each article's web page doesn't include licensing information.

Web scraping is not currently in scope for this project.

Additional context

Resolution

  • I would be interested in resolving this bug.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions