arrays - Postgres ordering table by element in large data set -
i have tricky problem trying find efficient way of ordering set of objects (~1000 rows) contain large (~5 million) number of indexed data points. in case need query allows me order table specific datapoint. each datapoint 16-bit unsigned integer.
i solving problem using large array:
object table:
id serial not null, category_id integer, description text, name character varying(255), created_at timestamp without time zone not null, updated_at timestamp without time zone not null, data integer[],
gist index:
create index object_rdtree_idx on object using gist (data gist__intbig_ops)
this index not being used when select query, , not anyway.
each day array field updated new set of ~5 million values
i have webserver needs list objects ordered value of particular data point:
example query:
select name, data[3916863] weight object order weight desc
currently, takes 2.5 seconds perform query.
question: there better approach? happy insertion side slow happens in background, need select query fast possible. in saying this, there limit how long insertion can take.
i have considered creating lookup table every value has it's own row - i'm not sure how insertion/lookup time affected approach , suspect entering 1000+ records ~5 million data points individual rows slow.
currently inserting row takes ~30 seconds acceptable now.
ultimately still on hunt scalable solution base problem, need solution work, solution doesn't need scale further.
update: wrong dismiss having giant table instead of array, while insertion time massively increased, query time reduced few milliseconds.
i altering generation algorithm save datum if non-zero , changed previous update. has reduced insertions few hundred thousands values takes few seconds.
new table:
create table data ( object_id integer, data_index integer, value integer, ) create index index_data_on_data_index on data using btree ("data_index");
new query:
select name, coalesce(value,0) weight objects left outer join data on data.object_id = objects.id , data_index = 7731363 order weight desc
insertion time: 15,000 records/second
query time: 17ms
first of all, need relational database this? not seem relating data other data. might better off flat-file format.
secondly, index on data
useless query showed. querying datum (a position in array) while index built on values in array. dropping index make inserts considerably faster.
if have stay postgresql other reasons (bigger data model, mvcc, security) suggest change data model , alter column data set type bytea storage external
. since data
column 4 x 5 million = 20mb stored out-of-line anyway, if explicitly set it, know have.
then create custom function in c fetches data value "directly" using pg_getarg_bytea_p_slice()
macro , (i not accomplished pg c programmer forgive me errors, should on way):
// function get_data_value() -- 4-byte value bytea // arg 0: bytea* data // arg 1: int32 position of element in data, 1-based pg_function_info_v1(get_data_value); datum get_data_value(pg_function_args) { int32 element = pg_getarg_int32_p(1) - 1; // second argument, make 0-based bytea *data = pg_getarg_bytea_p_slice(0, // first argument element * sizeof(int32), // offset data sizeof(int32)); // required 4 bytes pg_return_int32_p((int32*)data); }
the pg_getarg_bytea_p_slice()
macro retrieves slice of data disk , therefore efficient.
there samples of creating custom c functions in docs.
your query becomes:
select name, get_data_value(data, 3916863) weight object order weight desc;
Comments
Post a Comment