Commit graph

58 commits

Author SHA1 Message Date
dferstay
a35f4960a8 Make Worker.Shutdown() synchronous (#58)
Previously, a WaitGroup was used to track executing ShardConsumers
and prevent Worker.Shutdown() from returning until all ShardConsumers
had completed.  Unfortunately, it was possible for Shutdown() to race
with the eventLoop(), leading to a situation where Worker.Shutdown()
returns while a ShardConsumer is still executing.

Now, we increment the WaitGroup to keep track the eventLoop() as well
as the ShardConsumers.  This prevents shutdown from returning until all
background go-routines have completed.

Signed-off-by: Daniel Ferstay <dferstay@splunk.com>
2021-12-20 21:21:15 -06:00
Aurélien Rainone
6c9e594751 Make the lease refresh period configurable (#56)
* Add LeaseRefreshSpanMillis in configuration

For certain use cases of KCL the hard-coded value of 5s value,
representing the time span before the end of a lease timeout in
which the current owner gets to renew its own lease, is not
sufficient. When the time taken by ProcessRecords is higher
than 5s, the lease gets lost and the shard may end up to another
worker.

This commit adds a new configuration value, that defaults to 5s,
to let the user set this value to its own needs.

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Slight code simplification

Or readability improvement

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>
2021-12-20 21:21:15 -06:00
Tao Jiang
9ca9d901ca Fix error in puslishing cloud watch metrics (#55)
Reported at:
https://github.com/vmware/vmware-go-kcl/issues/54

The input params are not used to set monitor service in cloudwatch
Init function. The empty appName, streamName and workerID cause
PutMetricData failed with error string "Error in publishing
cloudwatch metrics. Error: InvalidParameter...".

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:15 -06:00
Aurélien Rainone
c9793728a3 Fix 'get records time' metric (#53)
The time sent to the `metrics.MonitoringService.RecordGetRecordsTime`'
was not the time taken by GetRecords, it was the time taken by
`GetRecords` and `ProcessRecords` additioned together.

Fixes #51

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>
2021-12-20 21:21:15 -06:00
Tao Jiang
eb56e3b1d7 Fix broken tests (#50)
Fix some broken unit and integ tests introduced by last commit.

Tests:
1. hmake test
2. Run integration test on Goland IDE and make sure all pass.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:15 -06:00
Aurélien Rainone
21980a54e3 Expose monitoring service (#49)
* Remove MonitoringConfiguration and export no-op service

MonitoringConfiguration is not needed anymore as the user directly
implements its monitoring service or use one the default constructors.

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Provide a constructor for CloudWatchMonitoringService

Unexport all fields

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Provide a constructor to PrometheusMonitoringService

Unexport fields

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Remove all CloudWatch specific-stuff from config package

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* NewWorker accepts a metrics.MonitoringService

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Fix tests

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Add WithMonitoringService to config

Instead of having an additional parameter to NewWorker so that the
user can provide its own MonitoringService, WithMonitoringService
is added to the configuration. This is much cleaner and remains
in-line with the rest of the current API.

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Fix tests after introduction of WithMonitoringService

Also, fix tests that should have been fixed in earlier commits.

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Move Prometheus into its own package

Also rename it to prometheus.MonitoringService to not have to repeat
Prometheus twice when using.

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Move CloudWatch metrics into its own package

Also rename it to cloudwatch.MonitoringService to not have to repeat
Cloudwatch twice when using.

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>

* Remove references to Cloudwatch in comments

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>
2021-12-20 21:21:15 -06:00
Tao Jiang
d6369e48c2 Fix the broken integrationt test (#48)
The https://github.com/vmware/vmware-go-kcl/pull/47 move zap
into its own packge but it also breaks the integration test.
This change is to fix integ test by correcting its package
reference.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:15 -06:00
Aurélien Rainone
8a8f9e6339 logger: move zap into its own package (#47)
Since #27 vmware-go-kcl has support the any logger interface,
which is very nice.

However due to the fact that `logger/zap.go` directly imports zap.
zap became a dependency of whoever uses `vmware-go-kcl.` The
problem is that zap also has many dependencies.

In order to avoid KCL users to pay a cost for a feature they don't
need, the zap stuff has been moved to a `logger/zap` sub-package.

Fixes #45

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>
2021-12-20 21:21:15 -06:00
Tao Jiang
971d748195 Fix missing init position with AT_TIMESTAMP (#44)
AT_TIMESTAMP start from the record at or after the specified
server-side Timestamp. However, the implementation was
missing. The bug was not notices until recently because most
of users never use this feature.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:15 -06:00
Tao Jiang
0d91fbd443 Add generic logger support (#43)
* Add generic logger support

The current KCL has tight coupling with logrus and it causes
issue for customer to use different logging system such as zap log.
The issue has been opened via:
https://github.com/vmware/vmware-go-kcl/issues/27

This change is to created a logger interface be able to abstract
above logrus and zap log. It makes easy to add support for other
logging system in the fugure. The work is based on:
https://www.mountedthoughts.com/golang-logger-interface/

Some updates are made in order to make logging system easily
injectable and add more unit tests.

Tested against real kinesis and dyamodb as well.

Signed-off-by: Tao Jiang <taoj@vmware.com>

* Add lumberjack configuration options to have fine grained control

Update the file log configuratio by adding most of luberjack
configuration to avoid hardcode default value. Let user to specify
the value because log retention and rotation are very important
for prod environment.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:14 -06:00
Aurélien Rainone
c8a5aa1891 Fix possible deadlock with getRecords in eventLoop (#42)
A waitgroup should always be incremented before the creation of the
goroutine which decrements it (through Done) or there is the
potential for deadlock.
That was not the case since the wg.Add was performed after the
`go getRecords() ` line.

Also, since there's only one path leading to the wg.Done in getRecords,
I moved wg.Done out of the getRecords function and placed it
alongside the goroutine creation, thus totally removing the need to
pass the waitgroup pointer to the sc instance, this lead to the
removal of the `waitGroup` field from the `ShardConsumer` struct.

This has been tested in production and didn't create any problem.

Signed-off-by: Aurélien Rainone <aurelien.rainone@gmail.com>
2021-12-20 21:21:14 -06:00
Tao Jiang
4f79203f44 Get rid of unused skipTableCheck (#39) 2021-12-20 21:21:14 -06:00
Tao Jiang
46fea317de Release shard lease after shutdown (#31)
* Release shard lease after shutdown

Currently, only local cached shard info has been removed when worker losts the
lease. The info inside checkpointer (dynamoDB) is not removed. This causes
lease has been hold until the lease expiration and it might take too long
for shard is ready for other worker to grab. This change release the lease
in checkpointer immediately.

The user need to ensure appropriate checkpointing before return from
Shutdown callback.

Test:
  updated unit test and integration test to ensure only the shard owner
has been wiped out and leave the checkpoint information intact.

Signed-off-by: Tao Jiang <taoj@vmware.com>

* Add code coverage reporting

Add code coverage reporting for unit test.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:14 -06:00
Tao Jiang
ac8d341cb1 Revert "Remove shard info in checkpointer (#29)" (#30)
This reverts commit 7e382e90d5d9eb30ed38cc1ab452336860f48b57.
2021-12-20 21:21:14 -06:00
Tao Jiang
8369884952 Remove shard info in checkpointer (#29)
Currently, only local cached shard info has been removed when worker losts the
lease. The info inside checkpointer (dynamoDB) is not removed. This causes
lease has been hold until the lease expiration and it might take too long
for shard is ready for other worker to grab. This change release the lease
in checkpointer immediately.

The user need to ensure appropriate checkpointing before return from
Shutdown callback.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:14 -06:00
Tao Jiang
fa0bbc42fe Update worker to let it inject checkpointer and kinesis (#28)
* Update worker to let it inject checkpointer and kinesis

Add two functions to inject checkpointer and kinesis for custom
implementation or adding mock for unit test.

This change also remove the worker_custom.go since it is no longer
needed.

Test:
  Update the integration tests to cover newly added functions.

Signed-off-by: Tao Jiang <taoj@vmware.com>

* Fix typo on the test function

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:14 -06:00
Tao Jiang
250bb2e9ff Use AWS built-in retry logic and refactor tests (#24)
Update the unit test and move integration test under test folder.
Update retry logic by switching to AWS's default retry.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:14 -06:00
Tao Jiang
6df520b343 Remove signal handling from event loop (#20)
Take signle handling out of event loop. Also, make the worker
Shutdown idempotent and update tests.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:14 -06:00
Tao Jiang
2ca82c25ca Add support for providing custom checkpointer (#17)
* Add credential configuration for resources

Add credentials for Kinesis, DynamoDB and Cloudwatch. See the worker_test.go
to see how to use it.

Signed-off-by: Tao Jiang <taoj@vmware.com>

* Add support for providing custom checkpointer

Provide a new constructor for adding checkpointer instead of alway using
default dynamodb checkpointer.

The next step is to abstract out the Kinesis into a generic stream API and
this will be bigger change and will be addressed in different PR.

Test:
  Use the new construtor to inject dynamodb checkpointer and run the existing
  tests.

Signed-off-by: Tao Jiang <taoj@vmware.com>

* Add support for providing custom checkpointer

Provide a new constructor for adding checkpointer instead of alway using
default dynamodb checkpointer.

The next step is to abstract out the Kinesis into a generic stream API and
this will be bigger change and will be addressed in different PR.

Fix checkfmt error.

Test:
  Use the new construtor to inject dynamodb checkpointer and run the existing
  tests.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:14 -06:00
Tao Jiang
c634c75ebc Add credential configuration for resources (#14)
Add credentials for Kinesis, DynamoDB and Cloudwatch. See the worker_test.go
to see how to use it.

Signed-off-by: Tao Jiang <taoj@vmware.com>
2021-12-20 21:21:14 -06:00
Tao Jiang
5140058e8b Update dependency steps
Makesure to update dependency packages before test.
2021-12-20 21:21:12 -06:00
Tao Jiang
13aa9632cd Upgrade to use go1.11 and switch to use go mod
1. No functional change just upgrade to go1.11.
2. Add go mod support.
3. Make vendored copy of dependencies

Test
1. hmake
2. run worker_test.go in GoLand IDE
2021-12-20 21:20:13 -06:00
Tim Studd
cd343cca09 Add configuration options for AWS service endpoints (#5)
* Add configuration options for AWS service endpoints

Signed-off-by: Timothy Studd <tim@goguardian.com>

* Fix KCL naming consistency issue

Signed-off-by: Timothy Studd <tim@goguardian.com>
2021-12-20 21:20:13 -06:00
Tao Jiang
03685b2b19 Fix type conversion error
Fix the compile issue of type conversion. int --> float64.
2021-12-20 21:20:13 -06:00
Tao Jiang
6a1a7b7da6 Fix the exponential backoff
Fix the calculation of exponential backoff. ^ is the XOR in
golang. Replaced it with math.exp2().
2021-12-20 21:20:13 -06:00
VMware GitHub Bot
2d5d506659 Add DCO text 2021-12-20 21:20:10 -06:00
VMware GitHub Bot
691082e284 Add DCO text 2021-12-20 21:19:26 -06:00
Tao Jiang
d6b5196b55 Update import path when switching github
Update import path in files when switching to github.
2021-12-20 21:19:26 -06:00
Tao Jiang
d13f8588a9 KCL: Update readme
Update the readme and contributing doc before publishing
to github repo.
https://github.com/vmware/vmware-go-kcl

Jira CNA-2036

Change-Id: Idd8cfd8c89d3202613ff1d3018a584945ad30e4a
2021-12-20 21:19:23 -06:00
Tao Jiang
10e8ebb3ff KCL: Fix KCL stops processing when Kinesis Internal Error
Current, KCL doesn't release shard when returning on error
which causes the worker cannot get any shard because it has
the maximum number of shard already. This change makes sure
releasing shard when return.

update the log message.

Test:
Integration test by forcing error on reading shard to
simulate Kinesis Internal error and make sure the KCL
will not stop processing.

Jira CNA-1995

Change-Id: Iac91579634a5023ab5ed73c6af89e4ff1a9af564
2021-12-20 21:16:38 -06:00
Tao Jiang
3163d31f28 KCL: KCL should ignore deleted parent shard
After a few days of shard splitting, the parent shard will be
deleted by Kinesis system. KCL should ignore the error caused
by deleted parent shared and move on.

Test:
Manuall split shard on kcl-test stream in photon-infra account
Currently, shard3 is the parent shard of shard 4 and 5. Shard 3
has a parent shard 0 which has been deleted already. Verified
the test can run and not stuck in waiting for parent shard.

Jira CNA-2089

Change-Id: I15ed0db70ff9836313c22ccabf934a2a69379248
2021-12-20 21:16:38 -06:00
Tao Jiang
9addbb57f0 KCL: Fix random number generator
Fix the random number generator by adding seed.
https://stackoverflow.com/questions/12321133/golang-random-number-generator-how-to-seed-properly

Jira CNA-1119

Change-Id: Idfe23d84f31a47dcf43c8025632ff6f115614d34
2021-12-20 21:16:38 -06:00
Tao Jiang
22de13ef8a Go-KCL: Update security scan
gas is now gosec. Need to update security scan and fix
security issue as needed.

No functional change.

Jira CNA-2022

Change-Id: I36f2a204114f3f13e2ed05579c04a9c89f528f9a
2021-12-20 21:16:38 -06:00
Tao Jiang
47daa9d5f0 KCL: Update copyright and permission
All source should be prepared in a manner that reflects
comments that VMware would be comfortable sharing with
the public.

Documentation only. No functional change.

Update the license to MIT to be consistent with approved
OSSTP product tracking ticket:
https://osstp.vmware.com/oss/#/upstreamcontrib/project/1101391

Jira CNA-1117

Change-Id: I3fe31f10db954887481e3b21ccd20ec8e39c5996
2021-12-20 21:16:27 -06:00
Tao Jiang
e2a945d824 KCL: Stuck on processing after kinesis shard splitting
The processing Kinesis gets stuck after splitting shard. The
reason is that the app doesn't do mandatory checkpoint.

KCL document states:
// When the value of {@link ShutdownInput#getShutdownReason()} is
// {@link com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownReason#TERMINATE} it is required that you
// checkpoint. Failure to do so will result in an IllegalArgumentException, and the KCL no longer making progress.

Also, fix shard lease to prevent one host takes more shard than
its configuration allowed.

Jira CNA-1701

Change-Id: Icbdacaf347c7a67b5793647ad05ff93cca629741
2021-12-20 21:15:25 -06:00
Tao Jiang
48fd4dd51c KCL: remove panic in shard consumer
There might be verious reason for shard iterator to
expire, such as: not enough data in shard or process
even takes more than 5 minutes which cause shard
iterator not refreshing enough.

This change removes log.Fatal which causes panic.
Panic inside go routine will bring down the whole
app. Therefore, just log error and exit the go routine
instead.

Jira ID: CNA-1072

Change-Id: I34a8d9af7258f3ea75465e2245bbc25c2fafee35
2021-12-20 21:15:25 -06:00
Tao Jiang
3120d89ae8 KCL: remove unused item
Change-Id: I164a3e551331464a020e82ad305294e5a659ab6c
2021-12-20 21:15:25 -06:00
Tao Jiang
85c04db6b4 KCL: Fix the way in returning error
Fix bug when doing shard sync which removing shard info.

Jira ID: CNA-612

Change-Id: Ibaf55fffa39b793abbfe3bd57999e5d37f82a52f
2021-12-20 21:15:25 -06:00
Long Zhou
2b9301cd47 Flatten directory structure
cascade-kinesis-client will be used as a submodule of other projects,
so it should not have "src/vmware.com/cascade-kinesis-client" in
its path. To build this project locally, please manually create
the parent folders.

Change-Id: I8844e6a0e32aae65b28496915d8507e9fb1058c6
2021-12-20 21:15:15 -06:00
Tao Jiang
2542ea1416 KCL: Remove lease entry in dynamoDB table when shard no longer exists
Need to remove lease entry in dynamodb table when shard has been removed
by Kinesis. This happens when doing shard splitting and parent shard will be moved
by Kinesis after its retention period (normally after 24 hours).

Change-Id: I70a5836436ac0698110085d46d9438fcaf539cd2
2021-12-20 21:13:21 -06:00
Tao Jiang
6384d89748 KCL: Organize the folder structure
Organize the folder structure in order to support imported as
submodule for other services.

Jira CNA-701

Change-Id: I1dda27934642bb8a7755df07dc4a5048449afc86
2021-12-20 21:13:21 -06:00
Tao Jiang
5ef4338a22 KCL: Update shard sync to remove not existed shard
Need to remove shard not longer existed in Kinesis from shardStatus
cache.

Change-Id: I09b4a4c3c6480b8300fa937e6073dcd578156b29
2021-12-20 21:13:21 -06:00
Tao Jiang
e1071abc80 KCL: Fix cloudwatch metrics
This changes fixed cloudwatch metrics publishing by adding long
running go routine to periodically publish cloudwatch metrics.
Also, shutdown metrics publishing when KCL is shutdown.

Test:
Run hmake test and verified cloudwatch metrics has been
published via AWS cloudwatch console.

Jira CNA-702

Change-Id: I78b347cd12939447b0daf93f51acf620d18e2f49
2021-12-20 21:13:21 -06:00
Tao Jiang
2fea884212 KCL: Enable Metrics
This change enables metrics reporting and fixes a few bug in metrics reporting.
The current metrics reporting is quite limited. Will add more metrics in
next cr.

Tested with both prometheus and cloudwatch.

Jira CNA-702

Change-Id: I678b3f8a372d83f7b8adc419133c14cd10884f61
2021-12-20 21:13:21 -06:00
Tao Jiang
9d1993547f KCL: Ignore Lint error on const
go languaage doesn't like all-caps on const. Since KCL is mainly from
Amazon's KCL, we'd like the constant to have the exactly same name as
Amazon's KCL. Thefore, skip the lint check.

Change-Id: Ib8a2f52a8f4b44d814eda264f62fdcd53cccc2a7
2021-12-20 21:13:21 -06:00
Tao Jiang
869a8e4275 KCL: Add support for handling shard split
Add support for handling child/parent shard. When processing
child shard, it has to wait until parent shard finished before
processing itself.

Change-Id: I8bbf104c22ae93409d856be9c6829988c1b2d7eb
2021-12-20 21:13:20 -06:00
Tao Jiang
c05bfb7ac8 KCL: Fixing checkpoint operation
This change fixed the bug of not finding checkpoint when process
restart. It also adds missing call to record processor for notifying
the shard info and checkpoint when application first started.

Test:
Run hmake test and verify the log.

Change-Id: I4bdf21ac10c5ee988a0860c140991f7d05975541
2021-12-20 21:13:20 -06:00
Tao Jiang
a323d2fd51 KCL: Implement Worker
This is the core part of KCL by implementing worker.
It has exactly the same interface as Amazon's KCL. Internally,
it uses code from GoKini in order to get the library
functionaly quickly.

This is a working version. The test code worker_test.go
shows how to use this library.

Dynamic resharding feature is out of the scope of M4.

Test:

1. A Kinesis stream named "kcl-test" has been created under photon-infra
account.
2. Download your AWS Credential from IAM user page.
3. Modify the worker_test.go to fill in your aws credential.
4. hmake test

Jira CNA-637

Change-Id: I886d255bab9adaf7a13bca11bfda51bedaacaaed
2021-12-20 21:13:20 -06:00
Tao Jiang
1969713863 KCL: Fix unit test
Fix code bug for removing cyclic dependency and fix unit test.

Test:
hmake test

Change-Id: Ib4d4ba416d0133542e6459459ddf43079ff53ab8
2021-12-20 21:13:20 -06:00
Tao Jiang
425daf70ce KCL: Implement Shard Lease (part 1/2)
This is the first part of implementing shard lease for Kinesis
Client library. It creates dynamoDB table for managing
Kinesis stream shard lease.

https://jira.eng.vmware.com/browse/CNA-636

Adjust error code value range.

Change-Id: I16565fa15332843101235fb14545ee69c2599f2f
2021-12-20 21:13:20 -06:00