calmingshoggoth (calmingshoggoth) wrote in ljdump,
calmingshoggoth
calmingshoggoth
ljdump

I had some problems with ljdump version 1.5.1 and fixed them

The first problem was that LJ limits you to 1000 fetches per hour. I made the loop sleep for four seconds (60*60/1000 is 3.6, so I rounded up) between fetches and it doesn't seem to have that problem anymore.

Then I ran into a problem with comments. The first comment id in the journal I am backing up is in the mid four thousands. When it fetches with an id of 1 it got back an empty set of comments (<comments></comments>). It then looped endlessly because it never changed the maxid. I changed the value in the .last file to be 4000 and it fetched the first thousand or so comments, but then there was another big gap in the comment ids (it jumped up to seven thousand something) and it again got stuck in an infinite loop.

Looking at the code I noticed that all of the ids are present in the comment.meta file, so I changed the code to grovel through that data structure instead of just blindly using maxid + 1 as the next id.

Here is the diff containing my changes:



--- ljdump.py 2010-12-28 18:14:40.000000000 -0500
+++ ljdump.py.new 2017-01-04 06:56:31.000000000 -0500
@@ -24,7 +24,7 @@
 #
 # Copyright (c) 2005-2010 Greg Hewgill and contributors
 
-import codecs, os, pickle, pprint, re, shutil, sys, urllib2, xml.dom.minidom, xmlrpclib
+import codecs, os, pickle, pprint, re, shutil, sys, urllib2, xml.dom.minidom, xmlrpclib, time
 from xml.sax import saxutils
 
 MimeExtensions = {
@@ -39,6 +39,12 @@
     import md5 as _md5
     md5 = _md5.new
 
+def minid(metacache, old_max):
+    try:
+        return min(x for x in metacache.keys() if x > old_max)
+    except:
+        return old_max
+
 def calcchallenge(challenge, password):
     return md5(challenge+md5(password).hexdigest()).hexdigest()
 
@@ -197,6 +203,7 @@
                     print "Error getting item: %s" % item['item']
                     pprint.pprint(x)
                     errors += 1
+                time.sleep(4);
             lastsync = item['time']
             writelast(Journal, lastsync, lastmaxid)
 
@@ -277,6 +284,10 @@
     newmaxid = maxid
     maxid = lastmaxid
     while True:
+        maxid = minid(metacache, maxid) - 1 # has to be minus one because the rest assumes plus one
+        if maxid == lastmaxid:
+            break #no more ids in the metacache
+
         try:
             try:
                 r = urllib2.urlopen(urllib2.Request(Server+"/export_comments.bml?get=comment_body&startid=%d%s" % (maxid+1, authas), headers = {'Cookie': "ljsession="+ljsession}))
@@ -312,6 +323,7 @@
             if found:
                 print "Warning: downloaded duplicate comment id %d in jitemid %s" % (id, jitemid)
             else:
+                print "Writing comment id %d from %s" % (id, comment['date'])
                 entry.documentElement.appendChild(createxml(entry, "comment", comment))
                 f = codecs.open("%s/C-%s" % (Journal, jitemid), "w", "UTF-8")
                 entry.writexml(f)


Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your IP address will be recorded 

  • 4 comments