2
0
Эх сурвалжийг харах

Fix HTML title parsing bugs.

This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.

Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.

The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
Ben Muthalaly 2 жил өмнө
parent
commit
77917e9b55

+ 1 - 1
archivebox/extractors/title.py

@@ -26,7 +26,7 @@ from ..logging_util import TimedProgress
 
 HTML_TITLE_REGEX = re.compile(
     r'<title.*?>'                      # start matching text after <title> tag
-    r'(.[^<>]+)',                      # get everything up to these symbols
+    r'([^<>]+)',                      # get everything up to these symbols
     re.IGNORECASE | re.MULTILINE | re.DOTALL | re.UNICODE,
 )