Tags · finos/htc-grid

v0.4.0

Merge pull request #67 from fgogolli/v040_changes

EKS Cluster & Nodes:
- Change to using [terraform-aws-modules/eks](https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest) for managing and deploying the EKS Cluster as well as related resources, such as: Node IAM Roles & Policies, Node Defaults incl. instance types, Security Groups and the AWS Auth ConfigMap.
- Change to using [EKS Managed Node Groups](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html) for all of the Core and Worker Node Groups.
- Configure [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) to manage the scaling and lifecycle of the EKS Managed Node Groups.
- Disable AWS Node Termination Handler, as it shouldn't be used in conjunction with EKS Managed Node Groups.
- Simplify and standardise VPC Endpoint creation. Add EKS Private VPC Endpoint to allow internal communications from the private subnet with the EKS Control Plane.
- Change node taints from `grid/type: Operator` to `htc/node-type: core` and `htc/node-type: worker`. Add those as labels and tags as well, to simplify operations and cluster visibility via kubectl and other monitoring solutions.
- Adjust default instance types for the Core and Worker Node Groups to allow for better diversification and deplopyment, both for OnDemand and Spot workloads.
- Change to using `cluster_name` instead of `eks_cluster_id` everywhere, in line with the new module changes.
- Add ability to specify EBS Volume type and size for the EKS Nodes.

EKS AddOns:
- Change to [eks-blueprints-addons](https://registry.terraform.io/modules/aws-ia/eks-blueprints-addons/aws/latest) for managing and deploying all of the EKS Blueprint AddOns and OSS Helm Releases, such as: CoreDNS, Kube-Proxy, VPC CNI, FluentBit, Cluster Autoscaler, AWS LoadBalancer Controller, CloudWatch Metrics, KEDA, InfluxDB, Prometheus & Grafana, as well as **all** the relevant configuration.
- Add implicit and explicit dependencies to fix the race conditions where the `AWS Loadbalancer Controller` may get deleted before being able to cleanup the AWS resources that it manages. The new dependency order guarantees a proper clean up of those resources before the `AWS LoadBalancer Controller` is destroyed during unprovisioning.
- Fix the explicit and implicit dependencies between the Kubernetes data sources and the underlying resources created by the `EKS Blueprints Addons` module.
- Move ingress and dashboard creation for Grafana to be handled via the Helm chart and clean up the un-needed additional Terraform resources. Add the Grafana Ingress URL as a Terraform output for the module.
- Adjust image and repo configuration to pull the correct version for `Cluster Autoscaler` and other components.
- Adjust the node selectors for FluentBit and CloudWatch agent DaemonSets to deploy to all nodes.
- Switch to using the new Go based high-performance FluentBit logger for CloudWatch.
- Disable Grafana Live Server (as it requires WebSockets).
- Add cookie based session stickiness to the Grafana ingress to allow the ALB Controller and the Grafana HA deployment to handle auth properly.
- Fix FluentBit based Container Insights Logs.
- Extend the CoreDNS creation timeout to 25Mins to allow for the control plane to self-heal in case of issues.

HTC-Grid:
- Change to using [eks-blueprints-addon](https://registry.terraform.io/modules/aws-ia/eks-blueprints-addon/aws/latest) for deploying the HTC-Grid Helm Chart as well as create the respective [IRSA Role](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html).
- Adjust IAM Policies & Permissions (ensuring CloudWatch Log Group lifecycle handling is done via Terraform), as well as formatting and naming to ensure concsistency for all the Lambdas.
- Split the Control Plane lambda defintions into their individual TF files, simplifying configuration and visibility and grouping for the resources created.

Terraform & Helm:
- Adjust all of the Terraform Registry modules to use `~>` version pinning, allowing any new non-major versions to be used (any minor and patch updates are allowed), simplifying dependency version updates and ensuring consistency.
- Upgrade all of the Terraform modules from the Terraform Registry to use the **current latest** versions.
- Upgrade all of the Terraform providers to use the latest available versions and major version pinning using thre `~>` operator.
- Upgrade all of the Helm charts and container images to the current latest version for all of the components.
- Remove image level pinning of Helm AddOn components and pinned only using the Helm release versions.
- Remove un-needed explicit `depends_on` statemenets which cause slowness and cyclic dependencies or failures on plan (by not allowing data sources to be computed before an apply).
- Fix cyclic dependency and remove the need for running targeted applies for the IAM Policies for the EKS Pull Through Cache and Agent permissions in the `apply`/`auto-apply` stages.
- Move to using `aws_api_gateway_rest_api_policy` instead of a direct policy attachment of a generic policy for `OpenAPI Private`, which showed changes on every `terraform apply`, due to the wildcard allow policy.
- Configure the AWS CloudWatch Metrics and AWS for FluentBit deployments to run on the `Core` nodes.
- Configure Grafana to start two replicas and spread them across different nodes for high availability.
- Clean up the Helm chart `values.yaml` files, removing any unneeded and nrequired config, simplifying the deployments. Consolidating Helm chart versions into a single variable for ease of change and visibility.
- Remove un-needed data sources and use module outputs as required to also enforce consistent implicit dependencies in Terraform.
- Simplify and consolidate the variable definitions, usage and functions across all of the resources and modules.
- Adjust output and variable descriptions, types and values to reflect the required information and ensure consistency.
- Adjust provider configurations to ensure correct credential retrieval and handling.
- Use `aws_htc_ecr` consistently across all of the Helm charts as the ECR source repository for pulling internal and pull-through images.

New Features:
- Upgrade `ElastiCache` to version 7 and started using the ***AWS Graviton3*** based `cache.r7g.large` instance(s) for the Redis cluster.
- Add ability to do in-place upgrades of the `ElastiCache` clusters by versioning the `Parameter Groups` created/used.
- Add `watch_htc.sh` script, which can be used to monitor the status of a Kubernetes job running tasks on HTC-Grid, as well as the status of the overall compute plane, including the HPA, Deployment, Nodes and Job Completion statuses as well as durations. The scripts takes two arguments, namely the namespace to be watched as well as the name of the Kubernetes job.
- Add support for correct handling of the `AWS Partition` as well as `AWS Partition DNS Suffix`.
- Add ability to automatically manage the lifecycle of the self-signed ALB Certificates via the deployment process (any certs about to expire will get automatically updated and rolled out without any downtime).
- Migrate to using `AWS Certificate Manager` instead of the `IAM Server Certificates` for the ALB Certs.
- Increase the self-signed ALB Cert validity to 1 year, with auto-renew if run within 6 months of expiration time
- Add ability to automatically create, update and destroy an `admin` Cognito user via the deployment, to be used for the Grafana authentication, reducing the need for manual steps during the setup as well as the workshop.
- Add user cleanup on `destroy` for the `admin` Cognito user (created for use with Grafana) as well as the relevant Cognito config with the Grafana Ingress.
- Switch to creating the Cognito User for Grafana using TF native resources.
- Switch the `grafana_admin_password` variable to be sensitive everywhere.
- Add template file and generation for submitting a batch of multi-session tasks instead of copying/replacing at runtime of the workshop. Adjust docs/workshop accordingly.

Lambda Runtimes:
- Unify all of the `lambda_runtimes` into a single Dockerfile, driving behavior via build time arguments.
- Add package updates at build time (incl. cache clearing post updates), to ensure latest versions of updates are always included in the runtime images.
- Migrate all build runtimes to use the ECR Pull Through Cache for the build images.
- Simplify and consolidated the lambda runtime build and push Terraform resources into a single map of resources.
- Fix Lambda Runtimes Dockerfile to handle different entrypoint source script for the provided runtime.

ECR & Image Builds:
- Change all container images to use the ECR pull through-cache where possible.
- Add a new pull-through-cache config for `registry.k8s.io`, to allow for pulling any cluster components automatically, i.e. the `cluster-autoscaler`.
- Add flag (`REBUILD_RUNTIMES`) which allows re-creating the local images for all the runtimes (without using the cache) and pushing them to ECR.
- Clean up `image_repository` keeping the minimum number of required external dependencies (that were not availble via an ECR Pull Through Cache), to be manually copied over to the local ECR repositories.
- Add the ability to cleanup the ECR Pull Through Cache repositories upon running `destroy-images`.
- Add image scanning on push/upload for all of the ECR Repositories.
- Move to using `for_each` instead of `count` for ECR Repositories ensuring they don't get destroyed from a simple order change in the JSON Config.

Cloud9:
- Fix all of the Cloud9 bootstrap errors, handling of different packages, correct installation and upgrade of all the components and improved the bootstrap logging to increase visibilty on the success or issues of the Cloud9 deployment.
- Update default versions for all pre-requisites for the Cloud9 environment to the latest versions.
- Add support for using main (i.e. downloading the current HEAD version of the repo) as a value for `HTCGridVersion` when deploying the Cloud9 environment.

Docs:
- Adjust workshop texts, screenshots and configs to reflect the latest changes introduced as part of this or previous PRs and give instructions on any possible deploy time issues and how to fix them.
- Add instructions on how to use the `watch-htc.sh` script for monitoring jobs and deployments.
- Add the quick one-command based option for disabling of Cloud9 Managed Temporary Credentials.
- Adjust wording, correct grammar mistakes and other typos and simplify language.
- Extend workshop cleanup steps to handle local state cleaning as well.

Misc.:
- Add `CHANGELOG.md` to the repository, including reflecting all of the previous releases and commits.
- Format all of the deployment files to ensure consistency in naming, spacing, newlines, etc.
- Adjust wording, correct grammar mistakes and other typos across comments and other texts.
- Cleanup old and unused files, charts, configs and commented out code.
- Fix the clean stage in the `init_grid` Makefile.
- Add `load_variables.sh` to `.gitignore`.
- Update all Copyright notices to reflect the current year (2023).

Sep 12, 2023
62afc75
zip
tar.gz
Notes

v0.3.6

build(deps-dev): bump word-wrap in /deployment/image_repository/cdk

Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](jonschlinkert/word-wrap@1.2.3...1.2.4)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Jul 19, 2023
edc6e7a
zip
tar.gz
Notes

v0.3.5

Merge pull request #38 from ruecarlo/main

fixed issue in cloud9 environment

Feb 28, 2022
d863ca3
zip
tar.gz

v0.3.4

Merge pull request #37 from ruecarlo/fix-quantlib-example

Fixing entry in quantlib example

Feb 28, 2022
c4edd33
zip
tar.gz
Notes

v0.3.3

fix: python example for through pull cache

Feb 25, 2022
310301d
zip
tar.gz
Notes

v0.3.2

Merge pull request #35 from ruecarlo/main

ECR Pull through fixes

Feb 24, 2022
ebeaa81
zip
tar.gz
Notes

v0.3.1

fix: migrating to versin 0.3.1

Sep 15, 2021
9f1af13
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.3

v0.4.2

v0.4.1