UNION, LIMIT and ORDER BY

Discussion:

(too old to reply)

Israel

2007-05-25 14:23:08 UTC

I've trying to write a query that seems like it should be simple but
for some reason my attempts are not working. This is really a general
SQL quesion and doesn't pertain to MySQL but I couldn't find a generic
database discussion group except on on advancement and theory and this
is really a basic query construction question.

Just say I have a table with three columns, name, date, score and
these represent the test scores for various people. Each person can
take the test as many times as they like but I only want to get back
the last result.

My first thought didn't seem right but I figured I'd try it anyway:
SELECT name, date, score FROM testresults
GROUP BY name
ORDER BY date DESC
And yes this is wrong and does not return anything useful since the
score returned seems to be based on where the grouping started in
someway.

My second thought was that the list of people is small and known so I
was going to just hardcode their names into the query:
SELECT name, date, score FROM testresults WHERE name = 'bob' ORDER BY
date DESC LIMIT 1
UNION ALL
SELECT name, date, score FROM testresults WHERE name = 'mary' ORDER BY
date DESC LIMIT 1
UNION ALL
SELECT name, date, score FROM testresults WHERE name = 'jim' ORDER BY
date DESC LIMIT 1
UNION ALL
This is syntactically incorrect.

Can anyone help me with this query?

strawberry

2007-05-25 15:03:31 UTC

Permalink

Post by Israel
I've trying to write a query that seems like it should be simple but
for some reason my attempts are not working. This is really a general
SQL quesion and doesn't pertain to MySQL but I couldn't find a generic
database discussion group except on on advancement and theory and this
is really a basic query construction question.
Just say I have a table with three columns, name, date, score and
these represent the test scores for various people. Each person can
take the test as many times as they like but I only want to get back
the last result.
SELECT name, date, score FROM testresults
GROUP BY name
ORDER BY date DESC
And yes this is wrong and does not return anything useful since the
score returned seems to be based on where the grouping started in
someway.
My second thought was that the list of people is small and known so I
SELECT name, date, score FROM testresults WHERE name = 'bob' ORDER BY
date DESC LIMIT 1
UNION ALL
SELECT name, date, score FROM testresults WHERE name = 'mary' ORDER BY
date DESC LIMIT 1
UNION ALL
SELECT name, date, score FROM testresults WHERE name = 'jim' ORDER BY
date DESC LIMIT 1
UNION ALL
This is syntactically incorrect.
Can anyone help me with this query?

SELECT t1.* FROM tests t1
LEFT JOIN tests t2 ON t1.name = t2.name
AND t1.date > t2.date
WHERE t2.date IS NULL;

Good Man

2007-05-28 17:35:39 UTC

Permalink

Ideally, you would have two tables here: one with people names (auto-
increment user id, user name) and one with test results (auto-increment
tableID, ID referencing the person, testdate, testresult). This would
make this query, and others you are likely to create afterwards,
incredibly easier (though this is untested):

SELECT t.date,t.score,u.name
FROM testresults t
JOIN users u ON t.userID=u.userID
GROUP BY t.userID
ORDER BY t.date DESC

The key to realizing there is a better way to design your database is
any replication of data: ie in your table, the name "Bob" is entered as
many times as he took the test. What if he called you up and said "Id
prefer my name be stored as Robert"? You'd have to go trawling through
this table, and any others, changing data when there is really no need
to do so - changing a single entry in the "users" table would do it.

If you don't want to change your table design, I would stick with your
query (minus the GROUP BY) and use a programming language (ie: PHP) to
display the results correctly.

Israel

2007-05-30 14:07:56 UTC

Permalink

Post by Good Man
The key to realizing there is a better way to design your database is
any replication of data: ie in your table, the name "Bob" is entered as
many times as he took the test. What if he called you up and said "Id
prefer my name be stored as Robert"? You'd have to go trawling through
this table, and any others, changing data when there is really no need
to do so - changing a single entry in the "users" table would do it.

My explanation of the tables was modified for brevity. There's
actually two tables (linked by FK) and the data is actually parameter
data for the state of some hardware components on a machine that get
logged at a regular interval but I wanted to create a query that would
just show me the latest for all of the components at arbitrary times.
This query would only list a handful of parameter types but there's on
the order of 100 parameter types all together and the parameter log
table keeps all of the history for post-analysis and can have on the
order of 100k-200k records so joining a table back onto itself is not
really option considering how slow it is.
Unless I'm missing something about the performance of joining in this
way I'll probably have to resort to multiple independent queries -
hence my original thought of using UNION ALL which I couldn't seem to
get the syntax correct for.
I was trying to avoid having to write any code because we already have
plans in the works for making a front end application that will allow
users to extract all sorts of data so I don't want to waste time
making a collection of hodge podge little apps or php code that
duplicates this work. Until we get the final application completed I
wanted to do everything via queries otherwise it will be like pulling
teeth to get users to stop using all of the "temporary" applications
but in the interim people can still get the data they need.

Israel

2007-05-30 14:35:22 UTC

Permalink

Post by Good Man
SELECT t.date,t.score,u.name
FROM testresults t
JOIN users u ON t.userID=u.userID
GROUP BY t.userID
ORDER BY t.date DESC

With the ORDER BY after the GROUP BY I just end up with the final
groups ordered by the date from some (seemingly random) row from each
group. What I need is for the first row picked for the group to be
the latest entry for that group but you can't switch the ORDER BY and
GROUP BY clauses.

strawberry

2007-05-30 16:18:13 UTC

Permalink

Post by Israel

Post by Good Man
SELECT t.date,t.score,u.name
FROM testresults t
JOIN users u ON t.userID=u.userID
GROUP BY t.userID
ORDER BY t.date DESC

I can't see what's wrong with the query I posted previously - but if
you provide us with an example dataset and the result you'd expect to
obtain from that, perhaps we can help. Ideally include the CREATE and
INSERT statements too.

Israel

2007-05-30 19:22:32 UTC

Permalink

Post by strawberry
I can't see what's wrong with the query I posted previously - but if
you provide us with an example dataset and the result you'd expect to
obtain from that, perhaps we can help. Ideally include the CREATE and
INSERT statements too.

The issue I have is that it performs a join onto itself and with
100k-200k records it is way to slow.

strawberry

2007-05-31 09:32:56 UTC

Permalink

Post by Israel

The issue I have is that it performs a join onto itself and with
100k-200k records it is way to slow.

Too slow? With appropriate indexes I'd imagine that it would take
about 5 seconds. Is that too slow? Is 'GROUP BY' quicker!?!.

Israel

2007-05-31 13:25:35 UTC

Permalink

Post by strawberry
Too slow? With appropriate indexes I'd imagine that it would take
about 5 seconds. Is that too slow? Is 'GROUP BY' quicker!?!.

Ok, so I must be missing something then. This is my sql to create the
tables:
CREATE TABLE `parameter` (
`Id` int(10) unsigned NOT NULL auto_increment,
`Name` varchar(45) NOT NULL default '',
PRIMARY KEY (`Id`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `parameterlog` (
`ParameterId` int(10) unsigned NOT NULL default '0',
`SampleTimestamp` double NOT NULL default '0',
`Data` varchar(50) NOT NULL default '',
KEY `FK_parameterdata_Parameter` (`ParameterId`),
CONSTRAINT `FK_parameterdata_Parameter` FOREIGN KEY (`ParameterId`)
REFERENCES `parameter` (`Id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The FK_parameterdata_Parameter creates a BTREE index on the
ParameterId field so performing a join with that field with should as
quick as possible - right?

I ran the following query:
SELECT p1.* FROM `parameterlog` p1
LEFT JOIN `parameterlog` p2 ON p1.ParameterId = p2.ParameterId
AND p1.`Sampletimestamp` < p2.`Sampletimestamp`
WHERE p2.`Sampletimestamp` IS NULL;

On a smaller database with about 56,000 rows in this table it took 141
seconds and returned 73 rows. I didn't have the patience to wait for
the query for the larger databases but I can only assume that it has
an exponential growth.

I ran the following query (which doesn't produce the correct results):
SELECT * FROM parameterlog
GROUP BY ParameterId
ORDER BY SampleTimestamp
With the same record set of 56,000 rows this query executed in 0.21
seconds.

I'm using a 1.86GHz Pentium laptop with 1GB RAM and I'm running MySQL
server 4.1.

strawberry

2007-05-31 15:04:47 UTC

Permalink

Post by Israel

Post by strawberry
Too slow? With appropriate indexes I'd imagine that it would take
about 5 seconds. Is that too slow? Is 'GROUP BY' quicker!?!.

Ok, so I must be missing something then. This is my sql to create the
CREATE TABLE `parameter` (
`Id` int(10) unsigned NOT NULL auto_increment,
`Name` varchar(45) NOT NULL default '',
PRIMARY KEY (`Id`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `parameterlog` (
`ParameterId` int(10) unsigned NOT NULL default '0',
`SampleTimestamp` double NOT NULL default '0',
`Data` varchar(50) NOT NULL default '',
KEY `FK_parameterdata_Parameter` (`ParameterId`),
CONSTRAINT `FK_parameterdata_Parameter` FOREIGN KEY (`ParameterId`)
REFERENCES `parameter` (`Id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The FK_parameterdata_Parameter creates a BTREE index on the
ParameterId field so performing a join with that field with should as
quick as possible - right?
SELECT p1.* FROM `parameterlog` p1
LEFT JOIN `parameterlog` p2 ON p1.ParameterId = p2.ParameterId
AND p1.`Sampletimestamp` < p2.`Sampletimestamp`
WHERE p2.`Sampletimestamp` IS NULL;
On a smaller database with about 56,000 rows in this table it took 141
seconds and returned 73 rows. I didn't have the patience to wait for
the query for the larger databases but I can only assume that it has
an exponential growth.
SELECT * FROM parameterlog
GROUP BY ParameterId
ORDER BY SampleTimestamp
With the same record set of 56,000 rows this query executed in 0.21
seconds.
I'm using a 1.86GHz Pentium laptop with 1GB RAM and I'm running MySQL
server 4.1.

I'm not really familiar with this approach - but what happens if you
put an index on sampletimestamp?

Israel

2007-05-31 20:55:01 UTC

Permalink

Post by strawberry
I'm not really familiar with this approach - but what happens if you
put an index on sampletimestamp?

Interestingly I tried that a few times and averaged 147 seconds and
then after I removed it the time averaged 154 seconds. The load on my
system is roughly the same as it was the first time I ran the test so
I have no clue why it changed but running the query does peg out my
CPU at 100% for a while.
Without more exhaustive testing I can't determine whether or not
putting an index on sampletimestamp had any appreciable effect.

Good Man

2007-06-01 14:47:06 UTC

Permalink

Post by Israel
The FK_parameterdata_Parameter creates a BTREE index on the
ParameterId field so performing a join with that field with should as
quick as possible - right?

yeah, but according to your query, you're also performing a join on
'Sampletimestamp'... so those should be indexed as well

Post by Israel
SELECT p1.* FROM `parameterlog` p1
LEFT JOIN `parameterlog` p2 ON p1.ParameterId = p2.ParameterId
AND p1.`Sampletimestamp` < p2.`Sampletimestamp`
WHERE p2.`Sampletimestamp` IS NULL;
SELECT * FROM parameterlog
GROUP BY ParameterId
ORDER BY SampleTimestamp
With the same record set of 56,000 rows this query executed in 0.21
seconds.

Of course it did. The query just says 'get everything' then group and order
by'. There's not much processing power/time in getting 'everything'.

Perhaps if you add an index on 'Sampletimestamp' things will happen much
quicker... have you looked at the EXPLAIN result of your query?