-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make sql.DB connection details configurable #2701
make sql.DB connection details configurable #2701
Conversation
97b2b34
to
2396964
Compare
|
||
## maxlifetime - specify the maximum lifetime of a connection. | ||
## default is forever (0s) | ||
max_lifetime = "0s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its to allow tuning of the maximum amount of time before releasing a connection and creating a new one. I included it as an escape hatch in case there is something funky in someones setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh. I personally would favor leaving it out. The only reason it was added to the golang sql lib was to handle an issue with MySQL databases. I cant see any reason why it would need to be here.
But the argument is just to keep out extra complexity where unneeded. Not a major issue.
|
||
## maxidle - specify the number of idle connections to maintain. see sql.DB for details. | ||
## default is 1. setting to 0 to force disconnect after use. | ||
max_idle = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this feature should be added. Because of how the plugin works, after the polling interval is complete, every single connection is going to be idle. Thus it's going to result in connections constantly being created and destroyed, defeating the purpose of the persistence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every single connection is going to be idle.
of which we only have one because of how the gather step is laid out. it only ever creates a single connection in its current form.
every single connection is going to be idle.
yup that single connection is going to be idle which is fine because the default idle allowed is 1 with a forever lifetime.
Thus it's going to result in connections constantly being created and destroyed
flat out incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus it's going to result in connections constantly being created and destroyed
unless you mean when its set to 0? because that is true. I will adjust the documentation to not suggest that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flat out incorrect.
Yes, because of other problems (such as setting max_conns
> 1 being pointless). As soon as they are fixed, as long as max_idle
< max_conns
the statement is completely accurate.
There is however one use case I didn't think of. Setting to 0
would allow restoring the previous behavior of tearing down the connection after every polling interval. Though care would need to be taken to ensure it's not torn down after every query.
plugins/inputs/postgresql/service.go
Outdated
} | ||
|
||
// Start starts the ServiceInput's service, whatever that may be | ||
func (p *Service) Start(telegraf.Accumulator) (err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was the plugin converted to a service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to hook into the Start/Stop functionality to maintain the connection. could also be done with a sync.Once but meh.
|
||
## maxopen - specify the maximum number of connections to maintain. see sql.DB for details. | ||
## default is 1. | ||
max_open = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting doesn't currently serve a purpose. Both the postgresql
and postgresql_extensible
plugins only execute one query at a time. Thus they'll never open more than 1 connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup hence why its set at 1 by default. there is also nothing preventing us from allow more than 1 in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also I am unsure of telegraf's collection behaviour if a input takes longer than the collection period. Does it invoke Gather again? if it does then this setting has a potential purpose. otherwise I am in favor of leaving it out of the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gather takes too long, it spits out an error and does nothing. It will not launch a second gather. Part of the plans I had for this was to parallelize the queries. It would also be beneficial to share the pool among multiple instances of the plugin. Then it would serve a purpose.
@@ -42,6 +40,20 @@ var sampleConfig = ` | |||
## | |||
address = "host=localhost user=postgres sslmode=disable" | |||
|
|||
## connection configuration. | |||
|
|||
## maxidle - specify the number of idle connections to maintain. see sql.DB for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users of telegraf aren't necessarily going to know what sql.DB
is, or where to see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first sentence is a sufficient description, I simply included the comment for further reading.
Are you implying we need to document it more? or that sql.DB shouldn't be referenced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either or. Just saying as it is it can confuse people. Have no current opinion on how it's addressed.
@@ -168,22 +168,22 @@ func (p *Postgresql) Gather(acc telegraf.Accumulator) error { | |||
sql_query += query_addon | |||
|
|||
if p.Query[i].Version <= db_version { | |||
rows, err := db.Query(sql_query) | |||
rows, err := p.DB.Query(sql_query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put in a retry handler. Otherwise if the connection gets killed, our query will fail. And if the server has something like an idle connection killer that's shorter than telegraf's gather interval, every single gather will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generally handled by the sql.DB no? it shouldn't continue to use a bad connection. there was a bug regarding this ages ago iirc. golang/go#3777
either way that problem is solved by setting the maxlifetime value. (see you're above question on why it should be included)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At most I'd put a db.Ping() at the start it will cause it to establish a working connection.
see https://golang.org/src/database/sql/sql.go?s=7997:9207#L588
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this isn't handled by the sql package.
I don't see how maxlifetime
has any bearing on this problem.
db.Ping()
is also not a good way to solve it. First you'd have to start a transaction to ensure that the query uses the same connection the ping just used. Then the ping would also just introduce extra latency as you're adding an additional query. Basically this is not what Ping()
is for. There are several discussions on the matter, such as this one and this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxlifetime solves the issue (of a connection being terminated by the db due to idle time) by forcing the driver to not reuse a connection after its been open for longer than the maxlifetime.
First you'd have to start a transaction to ensure that the query uses the same connection the ping just used
we've already established that these plugins have only ever used one connection (this PR makes no changes to that fact). Transaction isnt necessary.
You've basically complained that
- the connection may have been killed by a internal db process. Fine, ping actually will force it to reconnect by triggering that extra query and has a retry builtin. this prevents the first metric query from failing due to that particular situation, which could also be solved by setting maxlifetime to a value lower than the idle connection timeout.
- now you're complaining about the latency of adding the ping.
so either 1) accept that manually setting the max lifetime to a lower value is a solution to that. or 2) accept the added latency of the ping as a builtin protection against the issue.
I'm not interested in doing retry logic for which this pr already has 2 builtin solutions. yes a bunch of shit can go wrong at anytime none of which was being handle prior to this PR anyways. Feel free to write your own PR implementing whatever crazy retry logic you want. but quite frankly if your postgresql setup needs that kind of logic you probably have bigger issues.
regarding the examples:
-
is referring to a succesful ping after a failed query which makes perfect sense because if you look at the sql.DB code it discards a connection when a broken error is detected.
-
actually discusses the exact use case its being put to use for here. if something goes wrong between the ping/first query that is literally no different than something going wrong between any of the queries and is outside the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxlifetime solves the issue (of a connection being terminated by the db due to idle time) by forcing the driver to not reuse a connection after its been open for longer than the maxlifetime.
No it doesn't. For it to accomplish this goal, you would have to set the maxlifetime
value to less than the gather interval, which means that every gather it would be getting a new connection, defeating the purpose.
I'm not interested in doing retry logic for which this pr already has 2 builtin solutions. yes a bunch of shit can go wrong at anytime none of which was being handle prior to this PR anyways.
No, this PR is introducing new problems. I am trying to raise attention to those problems so that we don't introduce new bugs, and have people reporting issues because things aren't working right.
Feel free to write your own PR implementing whatever crazy retry logic you want. but quite frankly if your postgresql setup needs that kind of logic you probably have bigger issues.
An idle timer and network interruption is not crazy. DBs have idle timers, network firewalls have idle timers, people have poor connections, servers have maintenance, any number of things can cause the connection to be interrupted. Retry logic is something that has to be added to any application communicating with another app.
In regards to the ping, I will quote specific things from those links:
go-sql-driver/mysql#82 (comment)
Don't ping nor check the db connection before executing queries, it's racey.
go-sql-driver/mysql#82 (comment)
The key is to understand that a DB isn't a connection. When you sql.Open()
you get a DB. It manages a pool of connections in the background, and
doesn't open any until you need them. Doing a Ping() is a fine way to do an
initial check if you want, but it's not all that Go-ish.
go-sql-driver/mysql#82 (comment)
db.Ping() was introduced to give you an option to check the DSN since sql.Open(..) doesn't really open a connection: https://code.google.com/p/go/issues/detail?id=4804
Moreover it can be used to check if the server is still alive. So essentially @vadimtk is right. In many cases you "open" the DB right after you application starts and want to make sure you connection data is right. That's where db.Ping() is useful for.
what nil error from Ping() means is that "there
is at least one connection to the database which we don't yet know to be bad, or a new
connection was successfully established". I don't see where this information would be
valuable except perhaps right after sql.Open().
8f46c9f
to
ee2b082
Compare
I've removed the documentation around max_open/max_idle. I agree they could have been confusing and added a ping check at the start of each to make sure a working connection is available. |
Just briefly looked over this, will include in 1.4. Let me preface this by saying I haven't used the sql package before. Doesn't the sql package reopen a new connection automatically if one is closed cleanly? What cases would we need to retry or wait until the next Gather loop? |
It does not. It passes the error to the client and lets the client handle it. This is because the connection might have been closed for a number of reasons that the client might need to handle, or log (so that admins are aware their connections are unexpectedly closing). Pretty much any time we get a connection error we would need to immediately retry. Waiting until the next gather would not be a good idea for a few reasons.
|
Well that makes things tricky, wish it were more like http keepalive. Now I'm being really lazy, but does Ping() make a network call or just toss closed connections? |
Network call. |
So in that case we shouldn't use it to ensure a connection. |
You couldn't even if you wanted to. Aside from issues like it introduces latency in the operation, and the remote chance that something can happen between ping and query, the big issue is that there's no guarantee that the ping will happen on the same connection as the query. |
To answer your question yes, on a clean close it'll happily create a new connection. on a bad close that complicates matters. which the ping will resolve by making the network call (if we actually care, I don't I only added it because phemmer seemed concerned about things that in reality are not a huge deal or are easily solved by maxlifetime/tcp keepalives). alot of phemmers comments are in my opinion making mountains out of mole hills. here is the code for getting a connection:
Okay so lets go over the changes (ignoring ping I don't really care for that in general anyways) this PR changes the code from:
to:
When maxlifetime is set to < the collection interval these two are functionally equivalent. if a system runs into an issue they'd run into it in either one. All this bullshit about idle network closing by infrastructure are resolved by TCP keepalives (which is by default 10minutes in the driver currently being used), except for the idle connection shutdowns coming directly from postgresql (this is the original source of invoking Ping as it would likely detect this), these are configuration issue on the end user's side we can do nothing about, but again is simply resolved by setting the maxlifetime to an appropriate value if necessary. However given I believe each of the methods on DB have a builtin retry with a fallback to a new connection I'm fairly confident it won't need any tuning, see the code snippet below for QueryContext. If there is some idle connection termination in place on the user of telegraf's system that is their responsibility to configure it correctly. These are really not valid reasons to put a retry handling logic here. They are better handled in the drivers. We are talking about a worst case scenario that in order for a end user to get the previous behaviour from these changes is that they have to configure maxlifetime to < than collection interval. On the plus side this will generally allow for longer lasting connections to the DB due to the TCP keepalives by the driver, and the fact we are not forcibly reconnecting everytime unless it is configured to do so.
the code begs to differ:
|
So long as we don't skip a gather because postgres closed an idle connection we are good. |
I'll run some tests this weekend to verify idle connection termination won't impact the gather stage. |
No it will not. The only "clean close" is if the client closes the connection, in which case it will not be in the pool, and is irrelevant to the issue at hand. Obviously I'm unable to convince you, so here's some proof. I set up our firewall to drop idle connections after 15 seconds. I then configured telegraf to gather at 30 second intervals:
Here's the result of running telegraf with your PR:
Notice how it's behaving EXACTLY as I predicted it would. Every other gather fails. Please don't just dismiss my comments out of hand. I'm quite familiar with go & the sql library. I'm happy to adjust my telegraf config with whatever settings you would like me to test, but I guarantee, nothing will fix it until a retry is put in the code. |
@phemmer: and you didn't do the one thing I said would restore this code to the current behaviour in master: set max_lifetime to be less than your collection period. Here is an example:
notice it reproduces your example exactly. now watch what happens if I add in the max_lifetime (which causes the plugin to act the same as the code in master):
notice how its not erroring anymore. long story short: at worst you'll get no benefits from this PR, you'll just need to adjust your configuration, you would have needed to adjust your configuration regardless if your firewall has a short termination period. but as I've said:
I'm happy to do a follow up PR once this is in to expose the TCP keepalive setting so you can then get the full benefits of a persistant connection. |
I never did, I went and double/triple checked database/sql and the pgx driver for possible issues. The fact of the matter is if TCP Keepalives, or Max lifetime didn't solve the issue there would be something seriously wrong in either the driver or sql package. |
Why should I have to add a config param to fix an issue this PR introduces? Changes should be non-breaking. If
Why such an aversion to retries? A retry would fix the issue without any action from the user. It would also handle a plethora of other issues, such as random network interruption, db going down for maintenance, etc. |
I imagine it is possible for the DB to configure the idle timeout on connections, perhaps via pgbouncer? I feel like the client shouldn't show any error if an idle connection is closed by the server. If it is set less than interval that would happen each gather, is this correct? (My premises that idle connections should be allowed to be closed without effect is mostly based on my experience with http keep alive) |
Also, if we do retries it should really just be a retry. No exponential backoffs or anything :) |
Yup, but in reality if you actually care about always being able to collect metrics you're going to reserve the slot anyways so this is a non-issue. If you don't care about always being able to collect metrics then why are you continuously pushing on retries? =)
because they make the code harder to maintain, its more code that needs to be debugged, they tend burn cpu cycles/time attempting code that will never succeed except in extremely rare cases (like random and very short network interruptions), and most importantly are just not needed in this case.
yup, but current master is also impacted by these, so nothing new would be introduced. so don't apply it to this pr. and even if it was a separate PR I'd argue they have a negative impact on actually collecting metrics each cycle. It also only helps with extremely short network interruptions, like milliseconds-seconds.
if the db goes down for maintenance you can't collect metrics anyways, and guess what? now you just caused the gather to take much longer to complete (and likely still fail by the end anyways). again this is a non-issue. I'm really tired of this. =) |
pgbouncer does offer this, client_idle_timeout they obviously mark it as dangerous. =) I also tried using pgbouncer to reproduce this and it didn't work so I had to switch to tcpkill.
if its closed properly by the server it shouldn't. if its hard killed it might, but that would only happen similar to the firewall demonstration above. It shouldn't happen once gather actually starts.
if by 'it' you mean max_lifetime, and by 'that' create a new connection, then yes. =) |
What about if |
that is effectively the same as my examples using tcpkill above. but like I said I couldn't get pgbouncer to terminate my idle connections between a 10s interval. I set client_idle_timeout to 1s and it kept my client connection around between cycles. |
I should also mention the docs for this setting do encourage setting it properly:
|
If the connection is closed cleanly it should not affect subsequent gathers. If the firewall switches to DROP or someone digs up the cable with a backhoe then I'm okay with missing the next gather. |
50ec85d
to
0d4491c
Compare
0d4491c
to
5fb81b5
Compare
I'll fix this up again this weekend. |
9a582a1
to
ddd17d2
Compare
just an aside: the connection termination issue that was a hot topic in this PR can be addressed in a future PR. I managed to get changes necessary into database/sql to allow drivers to be easier to configure with connection details, such as a keepalive configuration. just waiting for it to land in go1.10, I'll make a PR to the driver and telegraf once that happens. |
ddd17d2
to
5eb57c0
Compare
@james-lawrence Sorry to leave this one sitting for so long, is this ready for review? |
@danielnelson its all good, yes its ready as far as I am concerned. there are some further improvements I can make once I finish getting changes upstream. |
New Default configuration behaviour, note that the pid,backend_start,client_port do not change
Setting max_idle to 0 note that all 3 change.