Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make sql.DB connection details configurable #2701

Conversation

james-lawrence
Copy link
Contributor

@james-lawrence james-lawrence commented Apr 22, 2017

New Default configuration behaviour, note that the pid,backend_start,client_port do not change

[telegraf]$ ./bin/telegraf --config telegraf.conf
2017-04-22T12:45:41Z I! Starting Telegraf (version dev-147-g3684ec63)
2017-04-22T12:45:41Z I! Loaded outputs: file
2017-04-22T12:45:41Z I! Loaded inputs: inputs.postgresql_extensible
2017-04-22T12:45:41Z I! Tags enabled: host=jambli
2017-04-22T12:45:41Z I! Agent Config: Interval:1s, Quiet:false, Hostname:"jambli", Flush Interval:4s 
2017-04-22T12:45:41Z I! Max Open 1
2017-04-22T12:45:41Z I! Max Idle 1
2017-04-22T12:45:41Z I! Max Lifetime {0s}
postgres=# SELECT datid,datname,pid,usesysid,backend_start,query_start,application_name,client_addr,client_hostname,client_port FROM pg_stat_activity WHERE application_name = 'telegraf';
 datid | datname  |  pid  | usesysid |         backend_start         |          query_start          | application_name | client_addr | client_hostname | client_port 
-------+----------+-------+----------+-------------------------------+-------------------------------+------------------+-------------+-----------------+-------------
 12408 | postgres | 22007 |       10 | 2017-04-22 08:46:40.003514-04 | 2017-04-22 08:48:29.030619-04 | telegraf         | ::1         |                 |       33750
(1 row)

postgres=# SELECT datid,datname,pid,usesysid,backend_start,query_start,application_name,client_addr,client_hostname,client_port FROM pg_stat_activity WHERE application_name = 'telegraf';
 datid | datname  |  pid  | usesysid |         backend_start         |          query_start          | application_name | client_addr | client_hostname | client_port 
-------+----------+-------+----------+-------------------------------+-------------------------------+------------------+-------------+-----------------+-------------
 12408 | postgres | 22007 |       10 | 2017-04-22 08:46:40.003514-04 | 2017-04-22 08:48:30.030588-04 | telegraf         | ::1         |                 |       33750
(1 row)

postgres=# SELECT datid,datname,pid,usesysid,backend_start,query_start,application_name,client_addr,client_hostname,client_port FROM pg_stat_activity WHERE application_name = 'telegraf';
 datid | datname  |  pid  | usesysid |         backend_start         |         query_start         | application_name | client_addr | client_hostname | client_port 
-------+----------+-------+----------+-------------------------------+-----------------------------+------------------+-------------+-----------------+-------------
 12408 | postgres | 22007 |       10 | 2017-04-22 08:46:40.003514-04 | 2017-04-22 08:48:33.0307-04 | telegraf         | ::1         |                 |       33750
(1 row)

Setting max_idle to 0 note that all 3 change.

[telegraf]$ ./bin/telegraf --config telegraf-reconnect.conf
2017-04-22T12:51:19Z I! Starting Telegraf (version dev-147-g3684ec63)
2017-04-22T12:51:19Z I! Loaded outputs: file
2017-04-22T12:51:19Z I! Loaded inputs: inputs.postgresql_extensible
2017-04-22T12:51:19Z I! Tags enabled: host=jambli
2017-04-22T12:51:19Z I! Agent Config: Interval:1s, Quiet:false, Hostname:"jambli", Flush Interval:4s 
2017-04-22T12:51:19Z I! Max Open 1
2017-04-22T12:51:19Z I! Max Idle 0
2017-04-22T12:51:19Z I! Max Lifetime {0s}
postgres=# SELECT datid,datname,pid,usesysid,backend_start,query_start,application_name,client_addr,client_hostname,client_port FROM pg_stat_activity WHERE application_name = 'telegraf';
 datid | datname  |  pid  | usesysid |         backend_start         |          query_start          | application_name | client_addr | client_hostname | client_port 
-------+----------+-------+----------+-------------------------------+-------------------------------+------------------+-------------+-----------------+-------------
 12408 | postgres | 22200 |       10 | 2017-04-22 08:50:05.047709-04 | 2017-04-22 08:50:05.054397-04 | telegraf         | ::1         |                 |       35120
(1 row)
postgres=# SELECT datid,datname,pid,usesysid,backend_start,query_start,application_name,client_addr,client_hostname,client_port FROM pg_stat_activity WHERE application_name = 'telegraf';
 datid | datname  |  pid  | usesysid |         backend_start         |          query_start          | application_name | client_addr | client_hostname | client_port 
-------+----------+-------+----------+-------------------------------+-------------------------------+------------------+-------------+-----------------+-------------
 12408 | postgres | 22224 |       10 | 2017-04-22 08:50:08.038521-04 | 2017-04-22 08:50:08.041547-04 | telegraf         | ::1         |                 |       35186
(1 row)
postgres=# SELECT datid,datname,pid,usesysid,backend_start,query_start,application_name,client_addr,client_hostname,client_port FROM pg_stat_activity WHERE application_name = 'telegraf';
 datid | datname  |  pid  | usesysid |         backend_start         |          query_start          | application_name | client_addr | client_hostname | client_port 
-------+----------+-------+----------+-------------------------------+-------------------------------+------------------+-------------+-----------------+-------------
 12408 | postgres | 22259 |       10 | 2017-04-22 08:50:12.063936-04 | 2017-04-22 08:50:12.070745-04 | telegraf         | ::1         |                 |       35278
(1 row)

@james-lawrence james-lawrence force-pushed the postgresql-connection-configuration branch 2 times, most recently from 97b2b34 to 2396964 Compare April 22, 2017 13:01

## maxlifetime - specify the maximum lifetime of a connection.
## default is forever (0s)
max_lifetime = "0s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its to allow tuning of the maximum amount of time before releasing a connection and creating a new one. I included it as an escape hatch in case there is something funky in someones setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh. I personally would favor leaving it out. The only reason it was added to the golang sql lib was to handle an issue with MySQL databases. I cant see any reason why it would need to be here.
But the argument is just to keep out extra complexity where unneeded. Not a major issue.


## maxidle - specify the number of idle connections to maintain. see sql.DB for details.
## default is 1. setting to 0 to force disconnect after use.
max_idle = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this feature should be added. Because of how the plugin works, after the polling interval is complete, every single connection is going to be idle. Thus it's going to result in connections constantly being created and destroyed, defeating the purpose of the persistence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every single connection is going to be idle.

of which we only have one because of how the gather step is laid out. it only ever creates a single connection in its current form.

every single connection is going to be idle.

yup that single connection is going to be idle which is fine because the default idle allowed is 1 with a forever lifetime.

Thus it's going to result in connections constantly being created and destroyed

flat out incorrect.

Copy link
Contributor Author

@james-lawrence james-lawrence Apr 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus it's going to result in connections constantly being created and destroyed

unless you mean when its set to 0? because that is true. I will adjust the documentation to not suggest that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flat out incorrect.

Yes, because of other problems (such as setting max_conns > 1 being pointless). As soon as they are fixed, as long as max_idle < max_conns the statement is completely accurate.

There is however one use case I didn't think of. Setting to 0 would allow restoring the previous behavior of tearing down the connection after every polling interval. Though care would need to be taken to ensure it's not torn down after every query.

}

// Start starts the ServiceInput's service, whatever that may be
func (p *Service) Start(telegraf.Accumulator) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the plugin converted to a service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to hook into the Start/Stop functionality to maintain the connection. could also be done with a sync.Once but meh.


## maxopen - specify the maximum number of connections to maintain. see sql.DB for details.
## default is 1.
max_open = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting doesn't currently serve a purpose. Both the postgresql and postgresql_extensible plugins only execute one query at a time. Thus they'll never open more than 1 connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup hence why its set at 1 by default. there is also nothing preventing us from allow more than 1 in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I am unsure of telegraf's collection behaviour if a input takes longer than the collection period. Does it invoke Gather again? if it does then this setting has a potential purpose. otherwise I am in favor of leaving it out of the documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gather takes too long, it spits out an error and does nothing. It will not launch a second gather. Part of the plans I had for this was to parallelize the queries. It would also be beneficial to share the pool among multiple instances of the plugin. Then it would serve a purpose.

@@ -42,6 +40,20 @@ var sampleConfig = `
##
address = "host=localhost user=postgres sslmode=disable"

## connection configuration.

## maxidle - specify the number of idle connections to maintain. see sql.DB for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users of telegraf aren't necessarily going to know what sql.DB is, or where to see it.

Copy link
Contributor Author

@james-lawrence james-lawrence Apr 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first sentence is a sufficient description, I simply included the comment for further reading.

Are you implying we need to document it more? or that sql.DB shouldn't be referenced?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either or. Just saying as it is it can confuse people. Have no current opinion on how it's addressed.

@@ -168,22 +168,22 @@ func (p *Postgresql) Gather(acc telegraf.Accumulator) error {
sql_query += query_addon

if p.Query[i].Version <= db_version {
rows, err := db.Query(sql_query)
rows, err := p.DB.Query(sql_query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put in a retry handler. Otherwise if the connection gets killed, our query will fail. And if the server has something like an idle connection killer that's shorter than telegraf's gather interval, every single gather will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally handled by the sql.DB no? it shouldn't continue to use a bad connection. there was a bug regarding this ages ago iirc. golang/go#3777

either way that problem is solved by setting the maxlifetime value. (see you're above question on why it should be included)

Copy link
Contributor Author

@james-lawrence james-lawrence Apr 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At most I'd put a db.Ping() at the start it will cause it to establish a working connection.
see https://golang.org/src/database/sql/sql.go?s=7997:9207#L588

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this isn't handled by the sql package.

I don't see how maxlifetime has any bearing on this problem.

db.Ping() is also not a good way to solve it. First you'd have to start a transaction to ensure that the query uses the same connection the ping just used. Then the ping would also just introduce extra latency as you're adding an additional query. Basically this is not what Ping() is for. There are several discussions on the matter, such as this one and this one.

Copy link
Contributor Author

@james-lawrence james-lawrence Apr 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxlifetime solves the issue (of a connection being terminated by the db due to idle time) by forcing the driver to not reuse a connection after its been open for longer than the maxlifetime.

First you'd have to start a transaction to ensure that the query uses the same connection the ping just used

we've already established that these plugins have only ever used one connection (this PR makes no changes to that fact). Transaction isnt necessary.

You've basically complained that

  1. the connection may have been killed by a internal db process. Fine, ping actually will force it to reconnect by triggering that extra query and has a retry builtin. this prevents the first metric query from failing due to that particular situation, which could also be solved by setting maxlifetime to a value lower than the idle connection timeout.
  2. now you're complaining about the latency of adding the ping.

so either 1) accept that manually setting the max lifetime to a lower value is a solution to that. or 2) accept the added latency of the ping as a builtin protection against the issue.

I'm not interested in doing retry logic for which this pr already has 2 builtin solutions. yes a bunch of shit can go wrong at anytime none of which was being handle prior to this PR anyways. Feel free to write your own PR implementing whatever crazy retry logic you want. but quite frankly if your postgresql setup needs that kind of logic you probably have bigger issues.

regarding the examples:

  1. is referring to a succesful ping after a failed query which makes perfect sense because if you look at the sql.DB code it discards a connection when a broken error is detected.

  2. actually discusses the exact use case its being put to use for here. if something goes wrong between the ping/first query that is literally no different than something going wrong between any of the queries and is outside the scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxlifetime solves the issue (of a connection being terminated by the db due to idle time) by forcing the driver to not reuse a connection after its been open for longer than the maxlifetime.

No it doesn't. For it to accomplish this goal, you would have to set the maxlifetime value to less than the gather interval, which means that every gather it would be getting a new connection, defeating the purpose.

I'm not interested in doing retry logic for which this pr already has 2 builtin solutions. yes a bunch of shit can go wrong at anytime none of which was being handle prior to this PR anyways.

No, this PR is introducing new problems. I am trying to raise attention to those problems so that we don't introduce new bugs, and have people reporting issues because things aren't working right.

Feel free to write your own PR implementing whatever crazy retry logic you want. but quite frankly if your postgresql setup needs that kind of logic you probably have bigger issues.

An idle timer and network interruption is not crazy. DBs have idle timers, network firewalls have idle timers, people have poor connections, servers have maintenance, any number of things can cause the connection to be interrupted. Retry logic is something that has to be added to any application communicating with another app.


In regards to the ping, I will quote specific things from those links:

go-sql-driver/mysql#82 (comment)

Don't ping nor check the db connection before executing queries, it's racey.

go-sql-driver/mysql#82 (comment)

The key is to understand that a DB isn't a connection. When you sql.Open()
you get a DB. It manages a pool of connections in the background, and
doesn't open any until you need them. Doing a Ping() is a fine way to do an
initial check if you want, but it's not all that Go-ish.

go-sql-driver/mysql#82 (comment)

db.Ping() was introduced to give you an option to check the DSN since sql.Open(..) doesn't really open a connection: https://code.google.com/p/go/issues/detail?id=4804
Moreover it can be used to check if the server is still alive. So essentially @vadimtk is right. In many cases you "open" the DB right after you application starts and want to make sure you connection data is right. That's where db.Ping() is useful for.

golang/go#7619 (comment)

what nil error from Ping() means is that "there
is at least one connection to the database which we don't yet know to be bad, or a new
connection was successfully established". I don't see where this information would be
valuable except perhaps right after sql.Open().

@james-lawrence james-lawrence force-pushed the postgresql-connection-configuration branch 4 times, most recently from 8f46c9f to ee2b082 Compare April 22, 2017 21:40
@james-lawrence
Copy link
Contributor Author

I've removed the documentation around max_open/max_idle. I agree they could have been confusing and added a ping check at the start of each to make sure a working connection is available.

@danielnelson danielnelson added this to the 1.4.0 milestone Apr 24, 2017
@danielnelson
Copy link
Contributor

Just briefly looked over this, will include in 1.4. Let me preface this by saying I haven't used the sql package before. Doesn't the sql package reopen a new connection automatically if one is closed cleanly? What cases would we need to retry or wait until the next Gather loop?

@phemmer
Copy link
Contributor

phemmer commented Apr 24, 2017

Just briefly looked over this, will include in 1.4. Let me preface this by saying I haven't used the sql package before. Doesn't the sql package reopen a new connection automatically if one is closed cleanly? What cases would we need to retry or wait until the next Gather loop?

It does not. It passes the error to the client and lets the client handle it. This is because the connection might have been closed for a number of reasons that the client might need to handle, or log (so that admins are aware their connections are unexpectedly closing).

Pretty much any time we get a connection error we would need to immediately retry. Waiting until the next gather would not be a good idea for a few reasons.

  1. If the connection was killed due to something like an idle timeout. Every other gather interval would fail. First gather runs, the pool is empty and gets a new connection, runs fine, and puts the connection in the pool. Some idle timer somewhere, which is shorter than gather interval kills the connection. Second gather runs, gets the connection from the pool, connection is dead, gather fails. Rinse & repeat.
  2. The gather interval might be really infrequent. Such a reason might be is if the query is really expensive, you don't want to run it very often. So if gather interval is 15 minutes, that means you have a half hour between successful gathers.

@danielnelson
Copy link
Contributor

danielnelson commented Apr 24, 2017

Well that makes things tricky, wish it were more like http keepalive. Now I'm being really lazy, but does Ping() make a network call or just toss closed connections?

@phemmer
Copy link
Contributor

phemmer commented Apr 24, 2017

Well that makes things tricky, wish it were more like http keepalive. Now I'm being really lazy, but does Ping() make a network call or just toss closed connections?

Network call.

@danielnelson
Copy link
Contributor

So in that case we shouldn't use it to ensure a connection.

@phemmer
Copy link
Contributor

phemmer commented Apr 24, 2017

You couldn't even if you wanted to. Aside from issues like it introduces latency in the operation, and the remote chance that something can happen between ping and query, the big issue is that there's no guarantee that the ping will happen on the same connection as the query.
The ping will check a connection out of the pool, do the ping, and put it back. Your query will check a connection out of the pool, do the query, and put it back. There's nothing that says your query will get the same connection as the ping.
The only way to ensure that 2 operations happen on the same connection is to use a transaction. And that just brings new problems you have to be aware of.

@james-lawrence
Copy link
Contributor Author

james-lawrence commented Apr 25, 2017

Doesn't the sql package reopen a new connection automatically if one is closed cleanly? What cases would we need to retry or wait until the next Gather loop?

To answer your question yes, on a clean close it'll happily create a new connection. on a bad close that complicates matters. which the ping will resolve by making the network call (if we actually care, I don't I only added it because phemmer seemed concerned about things that in reality are not a huge deal or are easily solved by maxlifetime/tcp keepalives). alot of phemmers comments are in my opinion making mountains out of mole hills.

here is the code for getting a connection:

// conn returns a newly-opened or cached *driverConn.

func (db *DB) conn(ctx context.Context, strategy connReuseStrategy) (*driverConn, error) {

	db.mu.Lock()

	if db.closed {

		db.mu.Unlock()

		return nil, errDBClosed

	}

	// Check if the context is expired.

	select {

	default:

	case <-ctx.Done():

		db.mu.Unlock()

		return nil, ctx.Err()

	}

	lifetime := db.maxLifetime


	// Prefer a free connection, if possible.

	numFree := len(db.freeConn)

	if strategy == cachedOrNewConn && numFree > 0 {

		conn := db.freeConn[0]

		copy(db.freeConn, db.freeConn[1:])

		db.freeConn = db.freeConn[:numFree-1]

		conn.inUse = true

		db.mu.Unlock()

		if conn.expired(lifetime) {

			conn.Close()

			return nil, driver.ErrBadConn

		}

		return conn, nil

	}


	// Out of free connections or we were asked not to use one. If we're not

	// allowed to open any more connections, make a request and wait.

	if db.maxOpen > 0 && db.numOpen >= db.maxOpen {

		// Make the connRequest channel. It's buffered so that the

		// connectionOpener doesn't block while waiting for the req to be read.

		req := make(chan connRequest, 1)

		reqKey := db.nextRequestKeyLocked()

		db.connRequests[reqKey] = req

		db.mu.Unlock()


		// Timeout the connection request with the context.

		select {

		case <-ctx.Done():

			// Remove the connection request and ensure no value has been sent

			// on it after removing.

			db.mu.Lock()

			delete(db.connRequests, reqKey)

			db.mu.Unlock()

			select {

			default:

			case ret, ok := <-req:

				if ok {

					db.putConn(ret.conn, ret.err)

				}

			}

			return nil, ctx.Err()

		case ret, ok := <-req:

			if !ok {

				return nil, errDBClosed

			}

			if ret.err == nil && ret.conn.expired(lifetime) {

				ret.conn.Close()

				return nil, driver.ErrBadConn

			}

			return ret.conn, ret.err

		}

	}


	db.numOpen++ // optimistically

	db.mu.Unlock()

	ci, err := db.driver.Open(db.dsn)

	if err != nil {

		db.mu.Lock()

		db.numOpen-- // correct for earlier optimism

		db.maybeOpenNewConnections()

		db.mu.Unlock()

		return nil, err

	}

	db.mu.Lock()

	dc := &driverConn{

		db:        db,

		createdAt: nowFunc(),

		ci:        ci,

	}

	db.addDepLocked(dc, dc)

	dc.inUse = true

	db.mu.Unlock()

	return dc, nil
}

Okay so lets go over the changes (ignoring ping I don't really care for that in general anyways)

this PR changes the code from:

func (p Postgres) Gather(...) {
db, err := sql.Open(...)
defer db.Close()
// ... execute queries
}

to:

func (p Postgres) Start(...) {
p.db, err := sql.Open(...)
// ...
}

func (p Postgres) Gather(...) {
  // ... execute queries
}

func (p Postgres) Shutdown() {
p.db.Close()
}

When maxlifetime is set to < the collection interval these two are functionally equivalent. if a system runs into an issue they'd run into it in either one. All this bullshit about idle network closing by infrastructure are resolved by TCP keepalives (which is by default 10minutes in the driver currently being used), except for the idle connection shutdowns coming directly from postgresql (this is the original source of invoking Ping as it would likely detect this), these are configuration issue on the end user's side we can do nothing about, but again is simply resolved by setting the maxlifetime to an appropriate value if necessary. However given I believe each of the methods on DB have a builtin retry with a fallback to a new connection I'm fairly confident it won't need any tuning, see the code snippet below for QueryContext.

If there is some idle connection termination in place on the user of telegraf's system that is their responsibility to configure it correctly. These are really not valid reasons to put a retry handling logic here. They are better handled in the drivers.

We are talking about a worst case scenario that in order for a end user to get the previous behaviour from these changes is that they have to configure maxlifetime to < than collection interval. On the plus side this will generally allow for longer lasting connections to the DB due to the TCP keepalives by the driver, and the fact we are not forcibly reconnecting everytime unless it is configured to do so.

If the connection was killed due to something like an idle timeout. Every other gather interval would fail. First gather runs, the pool is empty and gets a new connection, runs fine, and puts the connection in the pool. Some idle timer somewhere, which is shorter than gather interval kills the connection. Second gather runs, gets the connection from the pool, connection is dead, gather fails. Rinse & repeat.

the code begs to differ:

func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*Rows, error) {

	var rows *Rows

	var err error

	for i := 0; i < maxBadConnRetries; i++ {

		rows, err = db.query(ctx, query, args, cachedOrNewConn)

		if err != driver.ErrBadConn {

			break

		}

	}

	if err == driver.ErrBadConn {

		return db.query(ctx, query, args, alwaysNewConn)

	}

	return rows, err

}
// putConn adds a connection to the db's free pool.

// err is optionally the last error that occurred on this connection.

func (db *DB) putConn(dc *driverConn, err error) {

	db.mu.Lock()

	if !dc.inUse {

		if debugGetPut {

			fmt.Printf("putConn(%v) DUPLICATE was: %s\n\nPREVIOUS was: %s", dc, stack(), db.lastPut[dc])

		}

		panic("sql: connection returned that was never out")

	}

	if debugGetPut {

		db.lastPut[dc] = stack()

	}

	dc.inUse = false


	for _, fn := range dc.onPut {

		fn()

	}

	dc.onPut = nil


	if err == driver.ErrBadConn {

		// Don't reuse bad connections.

		// Since the conn is considered bad and is being discarded, treat it

		// as closed. Don't decrement the open count here, finalClose will

		// take care of that.

		db.maybeOpenNewConnections()

		db.mu.Unlock()

		dc.Close()

		return

	}

	if putConnHook != nil {

		putConnHook(db, dc)

	}

	added := db.putConnDBLocked(dc, nil)

	db.mu.Unlock()


	if !added {

		dc.Close()

	}

}

@danielnelson
Copy link
Contributor

So long as we don't skip a gather because postgres closed an idle connection we are good.

@james-lawrence
Copy link
Contributor Author

I'll run some tests this weekend to verify idle connection termination won't impact the gather stage.

@phemmer
Copy link
Contributor

phemmer commented Apr 26, 2017

To answer your question yes, on a clean close it'll happily create a new connection. on a bad close that complicates matters.

No it will not. The only "clean close" is if the client closes the connection, in which case it will not be in the pool, and is irrelevant to the issue at hand.

Obviously I'm unable to convince you, so here's some proof. I set up our firewall to drop idle connections after 15 seconds. I then configured telegraf to gather at 30 second intervals:

[agent]
  flush_interval = "1s"

[[outputs.file]]
  files = ["stdout"]
  tagexclude = ["host"]

[[inputs.postgresql_extensible]]
  interval = "30s"
  address = "host=172.28.128.7 sslmode=disable user=postgres dbname=postgres"
  tagexclude = ["db","server"]
  [[inputs.postgresql_extensible.query]]
    sqlquery = "select pg_backend_pid() as pid"

Here's the result of running telegraf with your PR:

# ./telegraf -config telegraf.conf
2017-04-26T03:56:25Z I! Starting Telegraf (version )
2017-04-26T03:56:25Z I! Loaded outputs: file
2017-04-26T03:56:25Z I! Loaded inputs: inputs.postgresql_extensible
2017-04-26T03:56:25Z I! Tags enabled: host=fll1000539l.local
2017-04-26T03:56:25Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"fll1000539l.local", Flush Interval:1s 
postgresql pid=27953i 1493178990000000000
2017-04-26T03:57:00Z E! Error in plugin [inputs.postgresql_extensible]: read tcp 192.168.102.79:65339->172.28.128.7:5432: read: connection reset by peer
postgresql pid=28002i 1493179050000000000
2017-04-26T03:58:00Z E! Error in plugin [inputs.postgresql_extensible]: read tcp 192.168.102.79:65367->172.28.128.7:5432: read: connection reset by peer
postgresql pid=28056i 1493179110000000000
2017-04-26T03:59:00Z E! Error in plugin [inputs.postgresql_extensible]: read tcp 192.168.102.79:65396->172.28.128.7:5432: read: connection reset by peer

Notice how it's behaving EXACTLY as I predicted it would. Every other gather fails.

Please don't just dismiss my comments out of hand. I'm quite familiar with go & the sql library.

I'm happy to adjust my telegraf config with whatever settings you would like me to test, but I guarantee, nothing will fix it until a retry is put in the code.

@james-lawrence
Copy link
Contributor Author

james-lawrence commented Apr 26, 2017

@phemmer: and you didn't do the one thing I said would restore this code to the current behaviour in master: set max_lifetime to be less than your collection period.

Here is an example:
telegraf-lifetime-misconfigured.conf

[agent]
	flush_interval = "1s"
	
[[outputs.file]]
	files = ["stdout"]
	tagexclude = ["host"]

[[inputs.postgresql_extensible]]
	interval = "10s"
	# max_lifetime = "9s" this line is commented out to cause the error to occur as in your example.
	address = "host=127.0.0.1 user=postgres sslmode=disable application_name=telegraf port=5432"
	databases = ["postgres"]
	[[inputs.postgresql_extensible.query]]
		sqlquery = "select pg_backend_pid() as pid"
tcpkill -i lo port 5432
./bin/telegraf --config telegraf-lifetime-misconfigured.conf
2017-04-26T22:56:17Z I! Starting Telegraf (version dev-148-gee2b0828)
2017-04-26T22:56:17Z I! Loaded outputs: file
2017-04-26T22:56:17Z I! Loaded inputs: inputs.postgresql_extensible
2017-04-26T22:56:17Z I! Tags enabled: host=jambli
2017-04-26T22:56:17Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"jambli", Flush Interval:1s 
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4232i 1493247380000000000
2017-04-26T22:56:30Z E! Error in plugin [inputs.postgresql_extensible]: write tcp 127.0.0.1:46760->127.0.0.1:5432: write: connection reset by peer
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4238i 1493247400000000000
2017-04-26T22:56:50Z E! Error in plugin [inputs.postgresql_extensible]: write tcp 127.0.0.1:46802->127.0.0.1:5432: write: connection reset by peer
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4243i 1493247420000000000
2017-04-26T22:57:10Z E! Error in plugin [inputs.postgresql_extensible]: write tcp 127.0.0.1:46844->127.0.0.1:5432: write: connection reset by peer
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4250i 1493247440000000000
2017-04-26T22:57:30Z E! Error in plugin [inputs.postgresql_extensible]: write tcp 127.0.0.1:46884->127.0.0.1:5432: write: connection reset by peer
2017-04-26T22:57:30Z I! Hang on, flushing any cached metrics before shutdown

notice it reproduces your example exactly. now watch what happens if I add in the max_lifetime (which causes the plugin to act the same as the code in master):
telegraf-lifetime-correctly-configured.conf

[agent]
	flush_interval = "1s"
	
[[outputs.file]]
	files = ["stdout"]
	tagexclude = ["host"]

[[inputs.postgresql_extensible]]
	interval = "10s"
	max_lifetime = "9s"
	address = "host=127.0.0.1 user=postgres sslmode=disable application_name=telegraf port=5432"
	databases = ["postgres"]
	[[inputs.postgresql_extensible.query]]
		sqlquery = "select pg_backend_pid() as pid"
tcpkill -i lo port 5432
./bin/telegraf --config telegraf-lifetime-correctly-configured.conf
2017-04-26T23:05:42Z I! Starting Telegraf (version dev-148-gee2b0828)
2017-04-26T23:05:42Z I! Loaded outputs: file
2017-04-26T23:05:42Z I! Loaded inputs: inputs.postgresql_extensible
2017-04-26T23:05:42Z I! Tags enabled: host=jambli
2017-04-26T23:05:42Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"jambli", Flush Interval:1s 
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4635i 1493247950000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4638i 1493247960000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4642i 1493247970000000000
postgresql,db=postgres,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432 pid=4647i 1493247980000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4650i 1493247990000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4653i 1493248000000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4656i 1493248010000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4668i 1493248020000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4672i 1493248030000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4676i 1493248040000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4680i 1493248050000000000
postgresql,server=host\=127.0.0.1\ user\=postgres\ application_name\=telegraf\ port\=5432,db=postgres pid=4684i 1493248060000000000
2017-04-26T23:07:42Z I! Hang on, flushing any cached metrics before shutdown

notice how its not erroring anymore.

long story short: at worst you'll get no benefits from this PR, you'll just need to adjust your configuration, you would have needed to adjust your configuration regardless if your firewall has a short termination period. but as I've said:

  • retries are not the solution.
  • there is an easy way to revert behaviour if your network has a idle connection killer that is less than 5 minutes.

I'm happy to do a follow up PR once this is in to expose the TCP keepalive setting so you can then get the full benefits of a persistant connection.

@james-lawrence
Copy link
Contributor Author

james-lawrence commented Apr 26, 2017

Please don't just dismiss my comments out of hand. I'm quite familiar with go & the sql library.

I never did, I went and double/triple checked database/sql and the pgx driver for possible issues.

The fact of the matter is if TCP Keepalives, or Max lifetime didn't solve the issue there would be something seriously wrong in either the driver or sql package.

@phemmer
Copy link
Contributor

phemmer commented Apr 26, 2017

Why should I have to add a config param to fix an issue this PR introduces? Changes should be non-breaking.
What if I'm a sys admin with a working config, and all of a sudden our network team decides to put in a idle timer on the firewalls. Telegraf is going to start breaking until I notice it and push out a config fix.

If max_lifetime is your only proposed way of restoring the previous behavior (non-persistence), this basically wastes a connection slot on the server, as the max lifetime is enforced when the connection is checked out from the pool.

retries are not the solution.

Why such an aversion to retries? A retry would fix the issue without any action from the user. It would also handle a plethora of other issues, such as random network interruption, db going down for maintenance, etc.

@danielnelson
Copy link
Contributor

I imagine it is possible for the DB to configure the idle timeout on connections, perhaps via pgbouncer? I feel like the client shouldn't show any error if an idle connection is closed by the server. If it is set less than interval that would happen each gather, is this correct?

(My premises that idle connections should be allowed to be closed without effect is mostly based on my experience with http keep alive)

@danielnelson
Copy link
Contributor

Also, if we do retries it should really just be a retry. No exponential backoffs or anything :)

@james-lawrence
Copy link
Contributor Author

james-lawrence commented Apr 27, 2017

Why should I have to add a config param to fix an issue this PR introduces?

  • because its not an issue or a bug as I've shown. (its only an issue to you)
  • because it will be clearly documented in the changelist.
  • because telegraf has plenty examples of changes between new versions that require configuration changes.
  • because its unrealistic to expect seamless upgrades between versions.
  • because you only need to make the configuration change if your network has an aggressive idle tcp connection killing mechanism, i.e. its kills connections idling for less than 5 minutes.

If max_lifetime is your only proposed way of restoring the previous behavior (non-persistence), this basically wastes a connection slot on the server.

Yup, but in reality if you actually care about always being able to collect metrics you're going to reserve the slot anyways so this is a non-issue. If you don't care about always being able to collect metrics then why are you continuously pushing on retries? =)

Why such an aversion to retries? A retry would fix the issue without any action from the user.

because they make the code harder to maintain, its more code that needs to be debugged, they tend burn cpu cycles/time attempting code that will never succeed except in extremely rare cases (like random and very short network interruptions), and most importantly are just not needed in this case.

random network interruption

yup, but current master is also impacted by these, so nothing new would be introduced. so don't apply it to this pr. and even if it was a separate PR I'd argue they have a negative impact on actually collecting metrics each cycle. It also only helps with extremely short network interruptions, like milliseconds-seconds.

db going down for maintenance

if the db goes down for maintenance you can't collect metrics anyways, and guess what? now you just caused the gather to take much longer to complete (and likely still fail by the end anyways). again this is a non-issue.

I'm really tired of this. =)

@james-lawrence
Copy link
Contributor Author

james-lawrence commented Apr 27, 2017

I imagine it is possible for the DB to configure the idle timeout on connections, perhaps via pgbouncer?

pgbouncer does offer this, client_idle_timeout they obviously mark it as dangerous. =) I also tried using pgbouncer to reproduce this and it didn't work so I had to switch to tcpkill.

I feel like the client shouldn't show any error if an idle connection is closed by the server.

if its closed properly by the server it shouldn't. if its hard killed it might, but that would only happen similar to the firewall demonstration above. It shouldn't happen once gather actually starts.

If it is set less than interval that would happen each gather, is this correct?

if by 'it' you mean max_lifetime, and by 'that' create a new connection, then yes. =)

@danielnelson
Copy link
Contributor

if by 'it' you mean max_lifetime, and by 'that' create a new connection, then yes. =)

What about if client_idle_timeout on pgbouncer is less than interval?

@james-lawrence
Copy link
Contributor Author

What about if client_idle_timeout on pgbouncer is less than interval?

that is effectively the same as my examples using tcpkill above. but like I said I couldn't get pgbouncer to terminate my idle connections between a 10s interval. I set client_idle_timeout to 1s and it kept my client connection around between cycles.

@danielnelson
Copy link
Contributor

I should also mention the docs for this setting do encourage setting it properly:

Client connections idling longer than this many seconds are closed. This should be larger than the client-side connection lifetime settings, and only used for network problems.

@danielnelson
Copy link
Contributor

If the connection is closed cleanly it should not affect subsequent gathers. If the firewall switches to DROP or someone digs up the cable with a backhoe then I'm okay with missing the next gather.

@james-lawrence james-lawrence force-pushed the postgresql-connection-configuration branch 3 times, most recently from 50ec85d to 0d4491c Compare July 28, 2017 17:36
@james-lawrence james-lawrence force-pushed the postgresql-connection-configuration branch from 0d4491c to 5fb81b5 Compare August 5, 2017 09:06
@danielnelson danielnelson modified the milestones: 1.4.0, 1.5.0 Aug 14, 2017
@danielnelson danielnelson added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Aug 24, 2017
@james-lawrence
Copy link
Contributor Author

I'll fix this up again this weekend.

@james-lawrence james-lawrence force-pushed the postgresql-connection-configuration branch 5 times, most recently from 9a582a1 to ddd17d2 Compare September 24, 2017 12:49
@james-lawrence
Copy link
Contributor Author

just an aside: the connection termination issue that was a hot topic in this PR can be addressed in a future PR. I managed to get changes necessary into database/sql to allow drivers to be easier to configure with connection details, such as a keepalive configuration. just waiting for it to land in go1.10, I'll make a PR to the driver and telegraf once that happens.

@james-lawrence james-lawrence force-pushed the postgresql-connection-configuration branch from ddd17d2 to 5eb57c0 Compare September 25, 2017 19:21
@danielnelson danielnelson modified the milestones: 1.5.0, 1.6.0 Nov 30, 2017
@danielnelson
Copy link
Contributor

@james-lawrence Sorry to leave this one sitting for so long, is this ready for review?

@james-lawrence
Copy link
Contributor Author

@danielnelson its all good, yes its ready as far as I am concerned. there are some further improvements I can make once I finish getting changes upstream.

@danielnelson danielnelson merged commit 1d86064 into influxdata:master Jan 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/postgresql feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants