Conversation
tests/test_metadata_utils.py
Outdated
| PROCESSORS["html"] = HtmlProcessor | ||
|
|
||
| text1, mask1 = add_local_metadata_to_text(self.examples[3], cfg) | ||
| target_text = '<a>useless text </a><div><a><div><div></div></div></a></div><h1><i>The Walking Dead</i> (season 8)</h1>\n' |
There was a problem hiding this comment.
Before the change I propose in this PR the result was here:
<a>useless text </div></a></div></a></div><h1><i><div><a><div><div>The Walking Dead</i> (season 8)</h1>
Just curious, is this (same index) a limitation of the source dataset? |
No no, I compile the dataset myself from Natural Questions and put it in the format proposed here. |
Ah, I see. I misunderstood. Thank you for the clarification! |
🙂
On the parser side I think it's really feasible, what's more blocking in my opinion is where to store the information in the |
I see. Funny you should mention that, because my first comment was gonna be "having a new property or even structure of jsonl for it" but then I didn't send it for exactly the same reason. lol That being said, I wonder if entities will also need that kind of nested construct. |
|
Hi @SaulLu, as discussed yesterday, I fully agree that this is an important issue that needs to be fixed. However, I'm not sure I fully understand how your proposed solution works - would you perhaps have 15 minutes or so sometime tomorrow (ideally, between 15:00 and 17:00 CEST) to go through this pull request with me? :) |
|
Summary of an offline discussion: unfortunately this PR does not solve all the concerns about the order of the local HTML metadata to be added. For example, @timoschick has rightly shown that Another pathological case is: I'm converting this PR into a draft, for now, the time to think about another workaround. |
I propose in this PR to change the way in which local tags are added to the text.
Motivation
HTML defines a tree. In this sense, it is important that the rules for opening and closing tags are respected. For example, the following html is not valid
"<h1> test </h1></a><a>". With the current way of adding local tags, this rule may not be respected because of self closing tags which can have their opening and closing tags on the same character index.As a side note, with the current metadata format, one level of information is lost for HTM tags. Indeed, when two tags have the same index for their opening and closing tag, we don't know which one is the parent.
Implementation details
Review
I would be very happy to have your opinion on this PR