Thursday, June 7, 2007

OlinDocs datbase generation (part 2)

OK so back to some Python. Here's the general setup I have (if you're just tuning in, you'll want to start here):

I have every document in a series of directories such that its location is ...\semester\class\prof\author.

Single Directory Database

This is mostly what we did yesterday (undersores represent tabbing b/c I seem to be having some trouble putting tabs in right in blogger. Stupid html):

def prelim(location):
_import os

_params = location.split('/')
_sem=params[-4]
_class_=params[-3]
_prof=params[-2]
_author=params[-1]

__db = open( location+'/part_db.txt', 'w' )

_for fileName in os.listdir (location):
_if fileName=='part_db.txt':
__continue

_temp = fileName.split('.')
_file_name = temp[0]
_file_type = temp[1]

_...more stuff...
_db.write('\t')
_db.write('\t')
_db.write('{a href="http://olindocs.com/')
_db.write(fileName)
_db.write('"}')
_db.write(file_type)
_db.write('{/a}')
_...more stuff...
_db.close()

path_file=open('/.../OlinDocs/path.txt','r')
path=path_file.read().replace('C:','').replace('\\','/').split('\n')
for location in path:
_prelim(location)

First it imports the module for looking at directories. Then it gets out all the metadata that we previously embedded in its location. by splitting the location at every /. Then I just assign each portion of the location to the variable it defines. You'll notice that I use negative indices so that I don't have to wory about how many directories in I'm looking. This works just as well from C:\\ as it does from C:\\~\~\~\~\~\ (that's actually a little bit of a lie b/c windows sucks, but we'll get to that later). I have a little if statement to skip adding part_db.txt to the database; this is a file that I'm using to build up a database and not something that needs to be on OlinDocs. What goes on next is just writing out a text file. I do end up doing dome cool things like having the text file have some html (I have <> in the real program) so that the metadata can include links. Then we close the file and we're done. We have a database that describes all of the documents in the directory. It's actually not in its final format because I just wrote the data for each file in a new line instead of doing the one line of names and one line of metadata thing. This is really easy to change later so I'm just keeping this step simple.

Cool. Now all of our directories have a file called part_db.txt that tells us about the documents in it, but it's in the wrong format and scatter everywhere. This is getting long, so we'll merge these databases and put them in the right format tomorrow.

No comments: