Export of one of the discussion threads (this is page ID 803932 in huwiki_p):
contains invalid (truncated) probably UTF-8 for the thread poster signature.
Hexdump of the export page reveals:
00000be0 74 3b 67 72 65 65 6e 26 71 75 6f 74 3b 20 66 61 |t;green" fa|
00000bf0 63 65 3d 26 71 75 6f 74 3b 4c 75 63 69 64 61 20 |ce="Lucida |
00000c00 63 61 6c 6c 69 67 72 61 70 68 79 26 71 75 6f 74 |calligraphy"|
00000c10 3b 26 67 74 3b ce 93 ce bf cf 85 ce b2 ce b2 ce |;>...........|
00000c20 bf cf 82 20 ce 98 ce b9 ce bb ce bf ce 3c 2f 54 |... .........</T|
00000c30 68 72 65 61 64 53 69 67 6e 61 74 75 72 65 3e 0a |hreadSignature>.|
0xCE byte at offset 0x00000c2a should be followed by at least one more byte to get a correct UTF-8 encoding.
XML dump process fails silently - the last page in those dumps:
http://download.wikimedia.org/huwiki/20110531/huwiki-20110531-pages-articles.xml.bz2
http://download.wikimedia.org/huwiki/20110614/huwiki-20110614-pages-articles.xml.bz2
is page ID 803931, after this there is no XML so whole dump is a non-valid XML.
It gets compressed via bzip2, though.
This problem was reported on the pywikipedia mailing list by Bináris:
http://thread.gmane.org/gmane.comp.python.pywikipediabot.general/11335
Version: unspecified
Severity: major
URL: https://hu.wikipedia.org/wiki/Speci%C3%A1lis:Lapok_export%C3%A1l%C3%A1sa/T%C3%A9ma:Szerkeszt%C5%91vita:Dencey/F%C3%B6l%C3%B6sleges_inform%C3%A1ci%C3%B3k/v%C3%A1lasz_%283%29
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=47885