30 January 2013

How To: extract from an undocumented database

Here's the problem which you encounter:
There's a database in your company that you need to extract data from. They  want to make that database redundant (and everybody mumbles around you "it's about time!") and you, as the database-know-how, required to create a report with the required fields.
I've done this daunting task twice: first, for a front end that run on Win 95 (in 2008!) That database was on a Syabase database, and believe me, that line isn't even on my CV. The second time I run it from MS-Access, and that was a way more pleasant experience. Those database had couple of things in common, and the most obvious one that they were shelf products, not custom made, and therefor had many tables that weren't actually in use.
And these are the steps I took:
First, ask your Business Analyst (or whomever had given you this task) which fields they need. They can usually give you at least one line of data, some names, etc, which could give you your reference point for investigation.
The second stage for me was to migrate the data to MS-SQL. This stage isn't strictly necessary; after all, if you can query one database you can query them all. But using my own local version has the benefit of (a) there's not a chance I'll mess up the data and (b) I'm using the SQL version I know well with its new features, and that enables me to focus on the problem, not on the differences between different SQL versions.
Now it's time for querying, but before that, a little script could help.
How do you get the size of all the tables in a database?
SELECT 
    t.NAME AS TableName,
    p.rows AS RowCounts,
    SUM(a.total_pages) * 8 AS TotalSpaceKB, 
    SUM(a.used_pages) * 8 AS UsedSpaceKB, 
    (SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnuasedSpaceKB
FROM 
    sys.tables t
INNER JOIN      
    sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN 
    sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN 
    sys.allocation_units a ON p.partition_id = a.container_id
WHERE 
    t.NAME NOT LIKE 'dt%' 
    AND t.is_ms_shipped = 0
    AND i.OBJECT_ID > 255 
GROUP BY 
    t.Name, p.Rows
ORDER BY 
    t.Name
The benefit of this script is clear: in a shelf database there could be heaps of tables that are not even in use. When you start investigating you don't need those. From my experience, start with the second largest table. The largest one is usually a log, and doesn't usually contain any necessary information. Of course you can ignore all the ones that have 0 data or only one line...

And now for the investigation itself. Look up that name. Look up those details. Try to locate foreign keys and indexes. That database was written by a human who has logic: how does that person's logic works?
Sure enough, you'll be able to present some results for your Business Analyst. They will probably be very happy with your results, and only have some comments. Modify the original script, and believe me, they'll be happy. Very happy, even.

No comments:

Post a Comment