Tuesday, March 20, 2012

Compression

Hi everyone,
I am developing a web app and wanting to store documents (doc, xls,
pdf, ppt...etc.) within the DB (MSSQL). Is it advisable in general (I
don't know exactly how big the docs are going to be) to compress the
files before putting them into the DB? How does this affect Full Text
Searching? On a wider note what's the attitude of people in terms of
storing documents in the DB versus on the file system. There seems to
be a lot of differing opinions. Any links to resources would be most
appreciated.
Thanks for your help in advance.
Jose
Jose,
First of all, can I assume that you are using SQL Server 2000? If so, could
you post the full output of: SELECT @.@.version -- as this is very help info
in understanding SQL FTS issues.
Secondly, what exactly do you mean by "compress[ing] the files before
putting them into the DB?" Do you mean to store them as ZIP files in columns
defined with the IMAGE datatype? or are you thinking about placing the FT
Catalog folder (SQL00060005) & files on a compressed drive. As for the
latter, I've not done any testing with compressed drive, but Hilary Cotter
has and from past postings on this subject, he's indicated that the
performance gains is mimumal. As for the former, you will need to purchase a
3rd Party IFilter for the WinZIP ZIP file format and then assuming you are
using SQL Server 2000, you can store the zip files in an IMAGE datatype and
define a "file extension" column that will have the value of "zip" and the
ZIP IFilter will be called by the MSSearch and the mssdmn (Search Filter
Daemon) processes to FT Index the contents of the zip file that contains MS
Word documents. You can purchase the ZIP IFilter from either of the
following two vendors:
http://www.4-share.com/content/products.htm
http://www.ifiltershop.com/zipfilter.html
I've personally have been using the 4-Share ZIP IFilter for over a year and
find it works very well for me.
As for "storing documents in the DB versus on the file system", this is very
much a open and very actively discussed subject, below is one past posting
I've made on this subject with links included...
"There are many issues related to storing the files on disk (faster access,
easy copy/move, etc.) vs. storing the files in the database (transaction
control, change control, audit of changes, etc) than just efficiency and
scalability, although, those are important points... There is a ASPFAQ
website that has a number of references
(http://www.aspfaq.com/show.asp?id=2149) and links to KB articles (I've not
checked them all), but one major reason for storing the files in a SQL
Server table's column defined with an IMAGE datatype, is that in SQL Server
2000 you can take advantage of the new Full-Text Search (FTS) feature to FT
Index the contents of supported file types, primarily MS Office, HTML and
3rd party IFilter's like Adobe's PDF files.
As this is one of those FAQ type questions that have been known to start
religious wars or at the very least a flame war, and while Sharon didn't ask
about other application issues that this question often generates as it is
often an open-ended question, i.e., one that never seems to be answered to
everyone's satisfactions as it usually "depends" upon the application or
upon how one defines the word "best"...
As for the Terra Server, checkout the "about page"
(http://terraserver-usa.com/about.aspx?n=AboutBody) on the site that
explains how MS and others did it and yes, they used the IMAGE datatype,
specifically
http://terraserver-usa.com/about.asp...utTechDbschema (The Imagery table
contains the "blob" [image] field where the imagery data is stored.
Jose, depending upon how many files you have, how frequently they change,
how large they are, what the app is, etc. you may want to review the Terra
Server web site and consider the following rule of thumb that they used:
< 1 million images or big images (> 1MB) put them in the file system.
> 1 million and < 1 MB images, put them in SQL Server.
Note, you can also use "Text-in-Row" if the files are >7000 bytes on avg.
For everything in between, either way will work, depending upon if you need
transactional control over your files. Additionally, if you store the files
in a TEXT or IMAGE column, you can also store related metadata about that
file in SQL Server as well for increased searching capabilities. Also, and
obviously with SQL Server you get built-in support for validating the
consistency of the database, indices, backup, restore, etc. As for loading
&/or extracting the files from SQL Server there are now many methods of
doing this via BCP, BULK INSERT as well as ADO Stream DTS too, if you need
to transform your files in some manner.
Hope this helps, as the primary consideration should always be what is best
for your application...
Regards,
John
"Jose Perez" <jlpv@.totalise.co.uk> wrote in message
news:3724a9d9.0404141212.480e1630@.posting.google.c om...
> Hi everyone,
> I am developing a web app and wanting to store documents (doc, xls,
> pdf, ppt...etc.) within the DB (MSSQL). Is it advisable in general (I
> don't know exactly how big the docs are going to be) to compress the
> files before putting them into the DB? How does this affect Full Text
> Searching? On a wider note what's the attitude of people in terms of
> storing documents in the DB versus on the file system. There seems to
> be a lot of differing opinions. Any links to resources would be most
> appreciated.
> Thanks for your help in advance.
> Jose
|||John,
Thanks a lot for the detailed reply which you gave - it is most
appreciated. In answer to your question, yes I am running SQL Server
2000 (Version Information - Microsoft SQL Server 2000 - 8.00.760
(Intel X86) Dec 17 2002 14:22:05 Copyright (c) 1988-2003 Microsoft
Corporation Standard Edition on Windows NT 5.0 (Build 2195: Service
Pack 4)).
To clarify, I meant storing documents as "ZIP files in columns defined
with the IMAGE datatype". Thanks for the URLs regarding the ZIP
IFilter - I will investigate them. Your post was very helpful -
thanks.
Jose
"John Kane" <jt-kane@.comcast.net> wrote in message news:<#qtcv8pIEHA.964@.TK2MSFTNGP10.phx.gbl>...[vbcol=seagreen]
> Jose,
> First of all, can I assume that you are using SQL Server 2000? If so, could
> you post the full output of: SELECT @.@.version -- as this is very help info
> in understanding SQL FTS issues.
> Secondly, what exactly do you mean by "compress[ing] the files before
> putting them into the DB?" Do you mean to store them as ZIP files in columns
> defined with the IMAGE datatype? or are you thinking about placing the FT
> Catalog folder (SQL00060005) & files on a compressed drive. As for the
> latter, I've not done any testing with compressed drive, but Hilary Cotter
> has and from past postings on this subject, he's indicated that the
> performance gains is mimumal. As for the former, you will need to purchase a
> 3rd Party IFilter for the WinZIP ZIP file format and then assuming you are
> using SQL Server 2000, you can store the zip files in an IMAGE datatype and
> define a "file extension" column that will have the value of "zip" and the
> ZIP IFilter will be called by the MSSearch and the mssdmn (Search Filter
> Daemon) processes to FT Index the contents of the zip file that contains MS
> Word documents. You can purchase the ZIP IFilter from either of the
> following two vendors:
> http://www.4-share.com/content/products.htm
> http://www.ifiltershop.com/zipfilter.html
> I've personally have been using the 4-Share ZIP IFilter for over a year and
> find it works very well for me.
> As for "storing documents in the DB versus on the file system", this is very
> much a open and very actively discussed subject, below is one past posting
> I've made on this subject with links included...
> "There are many issues related to storing the files on disk (faster access,
> easy copy/move, etc.) vs. storing the files in the database (transaction
> control, change control, audit of changes, etc) than just efficiency and
> scalability, although, those are important points... There is a ASPFAQ
> website that has a number of references
> (http://www.aspfaq.com/show.asp?id=2149) and links to KB articles (I've not
> checked them all), but one major reason for storing the files in a SQL
> Server table's column defined with an IMAGE datatype, is that in SQL Server
> 2000 you can take advantage of the new Full-Text Search (FTS) feature to FT
> Index the contents of supported file types, primarily MS Office, HTML and
> 3rd party IFilter's like Adobe's PDF files.
> As this is one of those FAQ type questions that have been known to start
> religious wars or at the very least a flame war, and while Sharon didn't ask
> about other application issues that this question often generates as it is
> often an open-ended question, i.e., one that never seems to be answered to
> everyone's satisfactions as it usually "depends" upon the application or
> upon how one defines the word "best"...
> As for the Terra Server, checkout the "about page"
> (http://terraserver-usa.com/about.aspx?n=AboutBody) on the site that
> explains how MS and others did it and yes, they used the IMAGE datatype,
> specifically
> http://terraserver-usa.com/about.asp...utTechDbschema (The Imagery table
> contains the "blob" [image] field where the imagery data is stored.
> Jose, depending upon how many files you have, how frequently they change,
> how large they are, what the app is, etc. you may want to review the Terra
> Server web site and consider the following rule of thumb that they used:
> < 1 million images or big images (> 1MB) put them in the file system.
> Note, you can also use "Text-in-Row" if the files are >7000 bytes on avg.
> For everything in between, either way will work, depending upon if you need
> transactional control over your files. Additionally, if you store the files
> in a TEXT or IMAGE column, you can also store related metadata about that
> file in SQL Server as well for increased searching capabilities. Also, and
> obviously with SQL Server you get built-in support for validating the
> consistency of the database, indices, backup, restore, etc. As for loading
> &/or extracting the files from SQL Server there are now many methods of
> doing this via BCP, BULK INSERT as well as ADO Stream DTS too, if you need
> to transform your files in some manner.
> Hope this helps, as the primary consideration should always be what is best
> for your application...
> Regards,
> John
>
> "Jose Perez" <jlpv@.totalise.co.uk> wrote in message
> news:3724a9d9.0404141212.480e1630@.posting.google.c om...

No comments:

Post a Comment