Monday, March 24, 2008

The Script

$dqlrun = $args[0]
$SelectStatement = @()
$SelectStatement += "$dqlrun"
$SelectStatement += "go"
$presult = $selectstatement [path to idql32.exe] [repository name, credentials, etc.] where-object {$_ -notlike "*row affected*"} where-object {$_ -notlike "*rows affected*"}
$result = @()
$split = @($presult[1].split(" ",[StringSplitOptions]::RemoveEmptyEntries))
$headers = @()
$numbers = @()
$start = 0
for ($i=0; $i -lt $split.length)
{
$length = $split[$i].length
$column = $presult[0].substring($start,$length)
$column = $column.trimend()
$headers += $column
$temp = @($start,$length)
$numbers += $temp
$start = $start + $length + 2
$i++
}
for ($a=2; $a -lt $presult.length)
{
$listitem = new-object psobject
$f = 0
for ($i=0; $i -lt $headers.length)
{
$value = $presult[$a].substring($numbers[$f],$numbers[$f+1])
$value = $value.trimend()
$listitem add-member -membertype noteproperty -name $headers[$i] -value $value
$i++
$f = $f+2
}
$result += $listitem
$a++
}
write-output $result

Output from this script should be a PowerShell object with the results of a Documentum Select statement that you pass it.

Saturday, March 22, 2008

Reports-Duplicate Files

A document in a Documentum folder is not like your grandmother's network share. No, no, you can have duplicate files in a folder. That's sometimes handy, because sometimes you have two bits of data that you want to store separately. Well, but if your users are making mistakes and checking in documents as new documents with duplicate names, well, you're in trouble.

This was the trickiest report to come up with, but the second shortest. That's because you can use just one statement - that's right, just one select statement - to return all the data you need. Operators are standing by... no, that's not right. My blog is not a gimmicky way to make money. It's something I'm writing down in case I ever go crazy and someone needs to fake being me for a while. Here goes:
select i_folder_id, object_name, a_content_type, count(*) from dm_document where folder ('/Cabinet', descend) group by i_folder_id, object_name, a_content_type having count(*) > 1
Wha...???
Because I'm not a SQL guy (not even close), this took about 20 hours just to construct and complete this command. But it's genius (if I do say so myself). I'll unpack it a bit:
i_folder_id is the object id of the folder containing the document (there's no telling what folder this is without another DQL statement - thanks, DQL), a_content_type is the metadata holding the file type (referred to in my last post), and count(*) - well, I don't really understand what it means. But, it has something to do with duplicates.

As I understand it, count(*) compares all of the items and attributes you select, and sees if any of them are the same. The group by near the end does this, too, somehow. Then, the having bit at the end says, "Just show me the entries that have duplicates within their own folders, and are of the same file type and filename." If you enter this statement and return a PowerShell object, you should have no problem creating a report in Excel.

Reports-Naming Consistency

One of the cool things I am able to do with custom objects is create reports based on, well, whatever my mind can come up with. The powers that be came up with 4 reports that we might want to run on a regular basis: Naming Consistency, Subfolder Creation, Duplicate Files, and Document Volume. The first three are not only metrics we can use to see how well people are doing following instructions (in case you were wondering, they were only doing OK). However, they are also things that users can utilize in "fixing" their problems. The Document Volume report is mostly just for management, to see what kind of volume of documents people are pushing.

The most difficult of these is the Naming Consistency. We have an extensive naming convention document that must be followed if we are to push documents to websites (this is just until workflow is up and running naming our mission-critical documents). I had to learn how to use .NET regular expressions (which are kind of complicated) to compare the names of documents to our standards, then if it failed, see why it failed, and put that on a report in Excel.

Here is an example of a regular expression I used for matching correspondence:
$CorrMatch = [regex] "^[a-z0-9]+\-((20[0-9]{2})(19[0-9]{2,2}))\-[0-1][0-9]\-[0-3][0-9]\-[\s \S]*$"
If you can follow that without having seen regular expressions, you're a better man (or woman) than I. Basically, $CorrMatch is an object of the type, "regex." The caret signifies the beginning of a filename, the dollar sign the end. The brackets enclose sets of characters to match against, the curly braces signify the number of times to repeat a given set. The pipes indicate OR, the backslashes come before a symbol I actually want to search for in the name. The plus sign indicates one or more matches, the asterisk zero or more matches, and the /S and /s are types of characters to search for. Phew.

With this set up, a match for a filename becomes pretty simple:
if ($DctmObject.object_name -match $CorrMatch) {DoStuff}
An example of something that would fit this regular expression is this filename:
230-2008-04-03-Complaint
Documentum doesn't use extensions, as the file format is stored as metadata on the object.

The way I ran this report before was rather brutish. I would go into each of a person's folders, look at each of the folders I wanted to match against, and compare them against the regular expressions I set up. If it failed, I would use more regular expressions to figure out why it failed, and put it into an Excel spreadsheet. One of the problems with this is that people inherit other people's mistakes.

The better way I found for doing this was to load all of the misnamed files into an object, then create a list of unique names of people using Get-Unique, and finally create worksheets for each person and load their misnamed files. This has the advantage of doing all of the Documentum work first, then the matching, and only then invoking Excel to put the data in.

One modification I had to make to get this to work properly was to extend the Folder object into a custom Folder type. We use the Desktop Client (because there was no way we were going to get our employees to adapt to the Webtop Client), so I created a custom type with account and management information on it, then set up a new tab in Documentum to display this new information. This has two primary benefits:
1. No matter where an account folder goes, I can always search them without knowing where they are
2. People in different offices can filter out the buildings that don't pertain to them
A couple other benefits will come in later, when I customize InputAccel to use this new information (it will cut down on the amount of verification that IA needs to use).

I think a faster way to do this would be to save the data as CSV, which PowerShell can do natively. This has the advantage of going around Excel. Controlling Excel with the COM object is slow (but very cool, nonetheless). I will post later on how this works.

I will post a little later on the other three reports. The cool thing about this report was the regex and the custom folder type - the rest was just puzzle work.

The Object

One of the difficult things about object-oriented programming for a non-object-oriented programmer is the whole object thing. What is an object? I knew a little just from having looked at this kind of stuff, but I didn't really know what was going on. This was to be my first foray from procedural programming into OO. I knew if I misstepped, it could be deadly (well, not really, but at least not very fun).

My first trick was to pass object within PowerShell, rather than writing to a text file and reading that text file. Reducing disk I/O's would improve the speed of the script and decrease the number of errors if I were to, say, run four scripts at the same time. This is pretty easily accomplished using an array of strings for DQL commands:

$DQLStatement = @()

$DQLStatement += "select * from dm_user"

$DQLStatement += "go"

$DQLStatement idql32.exe

That doesn't use any disk I/O (except for loading idql32.exe from the disk, which I haven't learned to get around yet). But it doesn't get us any closer to the ideal of using objects from Documentum, which is one of the advantages of Documentum in the first place. So, I wrote what I consider to be one of my crowning achievements at the company: Get-DocumentumObject.ps1. This script returns the result from idql32.exe, looks at the headers and lengths, and returns an object based on what it receives. It is much more elegant than what I was doing before. The first one I wrote returned the data to an XML file, but now it returns it as an object right in PowerShell, using the custom PSObject type. I will post it in a later post.

One of the main advantages of this is it only requires one line of code to return a Documentum object from a select statement:
$DctmObject = "select * from dm_user" y:\scripts\Get-DocumentumObject.ps1
At this point, $DctmObject would be a PowerShell object with all the information from the select statement.

A caveat here about this method: it is not perfect at all. If the return from the select statement is wider than a certain length, for some reason PowerShell wraps the text to the next line, causing the object to break. I may someday find a fix for this, but most of the time I don't need to use 40 columns, just a few.