Sunday, March 11, 2012

Complicated query with really big tables

Dear Experts,

I want to write a very complex query on very huge tables.

The scenario is the following:

I have one table called #Products, one table called #Shops , one table called #Customers and one table called #CollectedInformation.

#Products (ProductID int, ProductName varchar(10))

Product Information. Around 70,000 unique products

#Shops(ShopID smallint, ShopName varchar(10))

Shop Information. Around 100 unique shops

#Customers(CustomerID smallint, CustomerName varchar(10)

Customer Information. Around 100 unique customers

#CollectedInformation(ProductID int, ShopID smallint, CustomerID int, SalesValue money)

Sales Information collected and inserted in a huge table. We don’t sell all products in all shops and to all customers. Maximum 70,000 products * 100 Shops * 100 customers, but usually it will contain around 650 million records instead of 700 million which is the max.

What I would like to get is for all products, fo all shops, for all customers to create a temporary table that will get SalesValue based on joining #Products, #Customers, #Shops with #CollectedInformation.

In case a product or customer or shop doesn’t have a value for field SalesValue I want it to appeear in the final table with Null as SalesValue.

To achieve this I used the CTE in SQL server 2005 to create a cartesian product between #Products, #Shops and #Customers and then I left outer join it with table #CollectedInformation.

(The cte command only with 70,000 products, 100 customers and only 2 - Out of 100 total shops takes around 1:20 minutes on a 2cpu 3.6GHz xeon 2GB Ram server connected with an HP MSA1000 SAN.)

Can you please help me to determine if there is a better way to do it. I Haven't tried it with the full set, but I believe it will be very slow.

Following is an example of the code I used for testing:

declare @.Customers int, @.Shops int, @.Products int

declare @.counter int

-- select @.Customers = 100, @.Shops = 100, @.Products =70000

--select @.Customers = 100, @.Shops = 2, @.Products =70000

select @.Customers = 100, @.Shops = 2, @.Products =700

create table #Customers(CustomerID smallint, CustomerName varchar(10))

set @.counter=1

while @.counter<=@.Customers

begin

insert into #Customers values(@.counter,'Cn'+str(@.counter,8))

set @.counter = @.counter +1

end

create unique clustered index C1 on #Customers(CustomerID)

create table #Shops(ShopID smallint, ShopName varchar(10))

set @.counter=1

while @.counter<=@.Shops

begin

insert into #Shops values((@.counter*2)+100, 's'+str(@.counter,9))

set @.counter = @.counter +1

end

create unique clustered index S1 on #Shops(ShopID)

create table #Products (ProductID int, ProductName varchar(10))

set @.counter=1

while @.counter<=@.Products

begin

insert into #Products values(@.counter,'p'+str(@.counter,9))

set @.counter = @.counter +1

end

create unique clustered index P1 on #Products(ProductID)

create table #CollectedInformation(ProductID int, ShopID smallint, CustomerID int, SalesValue money)

declare @.TempShopID int

declare Shops_cursor cursor fast_forward for

select ShopID

from #Shops

open Shops_cursor

fetch next from Shops_cursor

into @.TempShopID

while @.@.fetch_status = 0

begin

insert into #CollectedInformation (ShopID, ProductID, CustomerID)

select @.TempShopID, P.ProductID, C.CustomerID

from #Customers C, #Products P

fetch next from Shops_cursor

into @.TempShopID

end

close Shops_cursor

deallocate Shops_cursor

create clustered index ci1 on #CollectedInformation (CustomerID)

create index ci2 on #CollectedInformation (ProductID)

create index ci3 on #CollectedInformation (ShopID)

declare @.random money

set @.random = rand()

update #CollectedInformation

set SalesValue = cast((10000*@.random)+(ProductID*@.random)+(CustomerID*@.random)+(ShopID*@.random) as money)

-- Following is the code that creates the final temporary table:

declare @.PeriodNo int, @.Indicate char(1)

set @.PeriodNo =120

set @.Indicate='A';

with results_cte (ProductID, ProductName, ShopID, ShopName, CustomerID, CustomerName )

as

(

select p.ProductID, p.ProductName, s.ShopID, s.ShopName, c.CustomerID, c.CustomerName

from #Products p, #Shops s, #Customers c

)

select @.PeriodNo as Period , @.Indicate as SpecialSymbol, r.ProductID, r.ProductName, r.CustomerID, r.CustomerName, r.ShopID, r.ShopName, ci.Salesvalue

into #temp

from results_cte r

left outer join #CollectedInformation ci on ci.ProductID=r.ProductID and ci.ShopID=r.ShopID and ci.CustomerID=r.CustomerID

order by r.ProductID, r.CustomerID, r.ShopID

--select top 100 * from #temp

drop table #Products

drop table #Shops

drop table #Customers

drop table #CollectedInformation

drop table #temp

What are you trying to achieve in the end? How is the temporary table going to be used?

If you want to report on this can I suggest you look at analysis services.

In your code, what takes the time? Be aware that you don't have an index on the CollectedInformation table.

In addition when you scale up to 650 million rows. Assuming you have 4 integer fields in this table you will be processing 10Gb of data from 1 table. You are planning on joining two of these together thats 20Gb of data.

Any large processing of this nature is going to take time, reduced by having more processing power and more memory.

|||

Dear Simon,

The reason I need to create the temporary table is because I have to pass it through a CLR function which after a lot of processing, it will export data into a text file. (One line from the text file will consist of many rows from the #temp table - Shops will become columns on the text file - Every Product, every customer, salesvalueshop1, salesvalueshop2 etc. The CTE will be executed and processed by the CLR.)

Thanks for your advice on analysis services. I will take a look at it.

I have three indexes on the CollectedInformation

create clustered index ci1 on #CollectedInformation (CustomerID)

create index ci2 on #CollectedInformation (ProductID)

create index ci3 on #CollectedInformation (ShopID)

About the time and the size the table will contain at least: 1 int, 1 smallint, 3 tinyint, and 2 money fields(perhaps 1 can be reduced to smallmoney). The amount of data is per period. On the text file I may have up to 100 periods as well. I processed data period by period so that to reduce the size of the tables.

Actually I don't know if the cardesian product that the CTE creates which then left outer joins with CollectedInformation table is the best way. What i want from the query is to join collectedinformation with products, customers and shops. Some values from products, customers, shops will not be included on the collectedinformation table, but I want them to appear at the final text file with a * instead of the null value for thr collectedinformation.salesvalue (e.g. for one product, for one customer i have salesvalue for shop1 and shop2 only. text file: Product1, customer1, shop1salesvalue, shop2salesvalue, *, * No sales for shop 3 and 4)

Regards,

Spyros Christodoulou

No comments:

Post a Comment