80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
DevOps, Cloud, Big Data, NoSQL, Python & Linux tools. All programs have --help
.
Hari Sekhon
Cloud & Big Data Contractor, United Kingdom
(you’re welcome to connect with me on LinkedIn)
Make sure you run make update
if updating and not just git pull
as you will often need the latest library submodule and possibly new upstream libraries
All programs and their pre-compiled dependencies can be found ready to run on DockerHub.
List all programs:
docker run harisekhon/pytools
Run any given program:
docker run harisekhon/pytools <program> <args>
installs git, make, pulls the repo and build the dependencies:
curl -L https://git.io/python-bootstrap | sh
or manually:
git clone https://github.com/HariSekhon/DevOps-Python-tools pytools
cd pytools
make
To only install pip dependencies for a single script, you can just type make and the filename with a .pyc
extension
instead of .py
:
make anonymize.pyc
Make sure to read Detailed Build Instructions further down for more information.
Some Hadoop tools with require Jython, see Jython for Hadoop Utils for details.
All programs come with a --help
switch which includes a program description and the list of command line options.
Environment variables are supported for convenience and also to hide credentials from being exposed in the process list
eg. $PASSWORD
, $TRAVIS_TOKEN
. These are indicated in the --help
descriptions in brackets next to each option and
often have more specific overrides with higher precedence eg. $AMBARI_HOST
, $HBASE_HOST
take priority over $HOST
.
anonymize.py
- anonymizes your configs / logs from files or stdin (for pasting to Apache Jira tickets or mailinganonymize_custom.conf
- put regex of your Name/Company/Project/Database/Tables to anonymize to <custom>
<fqdn>
, <password>
, <custom>
)--ip-prefix
leaves the last IP octect to aid in cluster debugging to still see differentiated nodes--hash-hostnames
- hashes hostnames to look like Docker temporary container ID hostnames so that vendors supportanonymize_parallel.sh
- splits files in to multiple parts and runs anonymize.py
on each part in parallel.anonymized
suffix. Preserves order of evaluationfind_duplicate_files.py
- finds duplicate files in one or more directory trees via multiple methods including filefind_active_server.py
- finds fastest responding healthy server or active master in high availability deployments,welcome.py
- cool spinning welcome message greeting your username and showing last login time and user to put in.profile
(there is also a perl version in my DevOps Perl Tools repo)aws_users_access_key_age.py
- lists all users access keys, status, date of creation and age in days. Optionallyaws_users_unused_access_keys.py
- lists users access keys that haven’t been used in the last N days or that haveaws_users_last_used.py
- lists all users and their days since last use across both passwords and access keys.aws_users_pw_last_used.py
- lists all users and dates since their passwords were last used. Optionally filters forgcp_service_account_credential_keys.py
- lists all GCP service account credential keys for a given project withdocker_registry_show_tags.py
/ dockerhub_show_tags.py
/ quay_show_tags.py
- shows tags for docker repos in a-q
/--quiet
to return onlydocker_pull_all_tags.sh
dockerhub_search.py
- search DockerHub with a configurable number of returned results (older officialdocker search
was limited to only 25 results), using --verbose
will also show you how many results were returned-q / --quiet
to return only the image names for easydocker_pull_all_images.sh
and can be chained with dockerhub_show_tags.py
to download all tagged versions for alldocker_pull_all_images_all_tags.sh
dockerfiles_check_git*.py
- check Git tags & branches align with the containing Dockerfile’s ARG *_VERSION
spark_avro_to_parquet.py
- PySpark Avro => Parquet converterspark_parquet_to_avro.py
- PySpark Parquet => Avro converterspark_csv_to_avro.py
- PySpark CSV => Avro converter, supports both inferred and explicit schemasspark_csv_to_parquet.py
- PySpark CSV => Parquet converter, supports both inferred and explicit schemasspark_json_to_avro.py
- PySpark JSON => Avro converterspark_json_to_parquet.py
- PySpark JSON => Parquet converterxml_to_json.py
- XML to JSON converterjson_to_xml.py
- JSON to XML converterjson_to_yaml.py
- JSON to YAML converterjson_docs_to_bulk_multiline.py
- converts json files to bulk multi-record one-line-per-json-document format foryaml_to_json.py
- YAML to JSON converter (because some APIs like GitLab CI Validation API require JSON)validate_*.py
further down for all these formats and moreambari_blueprints.py
- Blueprint cluster templating and deployment tool using Ambari API
ambari_blueprints/
directory for a variety of Ambari blueprint templates generated by and deployableambari_ams_*.sh
- query the Ambari Metrics Collector API for a given metrics, list all metrics or hostsambari_cancel_all_requests.sh
- cancel all ongoing operations using the Ambari APIambari_trigger_service_checks.py
- trigger service checks using the Ambari APIHadoop HDFS:
hdfs_find_replication_factor_1.py
- finds HDFS files with replication factor 1, optionally resetting them tohdfs_time_block_reads.jy
- HDFS per-block read timing debugger with datanode and rack locations for a given filehdfs_files_native_checksums.jy
- fetches native HDFS checksums for quicker file comparisons (about 100x fasterhdfs dfs -cat | md5sum
)hdfs_files_stats.jy
- fetches HDFS file stats. Useful to generate a list of all files in a directory treehive_schemas_csv.py
/ impala_schemas_csv.py
- dumps all databases, tables, columns and types out in CSV formatThe following programs can all optionally filter by database / table name regex:
hive_foreach_table.py
/ impala_foreach_table.py
- execute any query or statement against every Hive / Impalahive_tables_row_counts.py
/ impala_tables_row_counts.py
- outputs tables row counts. Useful for reconciliationhive_tables_column_counts.py
/ impala_tables_column_counts.py
- outputs tables column counts. Useful forhive_tables_row_column_counts.py
/ impala_tables_row_column_counts.py
- outputs tables row and column counts.hive_tables_row_counts_any_nulls.py
/ impala_tables_row_counts_any_nulls.py
- outputs tables row counts wherehive_tables_null_columns.py
/ impala_tables_null_columns.py
- outputs tables columns containing only NULLs.hive_tables_null_rows.py
/ impala_tables_null_rows.py
- outputs tables row counts where all fields containhive_tables_metadata.py
/ impala_tables_metadata.py
- outputs for each table the matching regex metadata DDLhive_tables_locations.py
/ impala_tables_locations.py
- outputs for each table its data locationhbase_generate_data.py
- inserts random generated data in to a given HBase table,hbase_show_table_region_ranges.py
- dumps HBase table region ranges information, useful when pre-splittinghbase_table_region_row_distribution.py
- calculates the distribution of rows across regions in an HBase table,hbase_table_row_key_distribution.py
- calculates the distribution of row keys by configurable prefix length inhbase_compact_tables.py
- compacts HBase tables (for off-peak compactions). Defaults to finding and iteratinghbase_flush_tables.py
- flushes HBase tables. Defaults to finding and iterating on all tables or takes anhbase_regions_by_*size.py
- queries given RegionServers JMX to lists topN regions by storeFileSize orhbase_region_requests.py
- calculates requests per second per region across all given RegionServers or averagehbase_regionserver_requests.py
- calculates requests per regionserver second across all given regionservers orhbase_regions_least_used.py
- finds topN biggest/smallest regions across given RegionServers than have receivedopentsdb_import_metric_distribution.py
- calculates metric distribution in bulk import file(s) to find data skewopentsdb_list_metrics*.sh
- lists OpenTSDB metric names, tagk or tagv via OpenTSDB API or directly from HBasepig-text-to-elasticsearch.pig
- bulk index unstructured files in Hadoop topig-text-to-solr.pig
- bulk index unstructured files in Hadoop topig_udfs.jy
- Pig Jython UDFs for Hadoopfind_active_server.py
- returns first available healthy server or active master in high availability deployments,--host
option but forfind_active_hadoop_namenode.py
- returns active Hadoop Namenode in HDFS HAfind_active_hadoop_resource_manager.py
- returns active Hadoop Resource Manager in Yarn HAfind_active_hbase_master.py
- returns active HBase Master in HBase HAfind_active_hbase_thrift.py
- returns first available HBase Thrift Server (runfind_active_hbase_stargate.py
- returns first available HBase Stargate rest serverfind_active_apache_drill.py
- returns first available Apache Drill nodefind_active_cassandra.py
- returns first available Apache Cassandra nodefind_active_impala*.py
- returns first available Impala node of either Impalad,find_active_presto_coordinator.py
- returns first available Presto Coordinatorfind_active_kubernetes_api.py
- returns first available Kubernetes API serverfind_active_oozie.py
- returns first active Oozie serverfind_active_solrcloud.py
- returns first available Solr / SolrCloud nodefind_active_elasticsearch.py
- returns first available Elasticsearch nodetravis_last_log.py
- fetches Travis CI latest running / completed / failed build log for given repo -travis_debug_session.py
- launches a Travis CI interactive debug build session via Travis API, tracksselenium_hub_browser_test.py
- checks Selenium Grid Hub / Selenoid is working by calling browsers such asvalidate_*.py
- validate files, directory trees and/or standard input streams
.avro
, .csv
, json
, parquet
,.ini
/.properties
, .ldif
, .xml
, .yml
/.yaml
)The automated build will use ‘sudo’ to install required Python PyPI libraries to the system unless running as root or it
detects being inside a VirtualEnv. If you want to install some of the common Python libraries using your OS packages
instead of installing from PyPI then follow the Manual Build section below.
Enter the pytools directory and run git submodule init and git submodule update to fetch my library repo:
git clone https://github.com/HariSekhon/DevOps-Python-tools pytools
cd pytools
git submodule init
git submodule update
sudo pip install -r requirements.txt
Download the DevOps Python Tools and Pylib git repos as zip files:
https://github.com/HariSekhon/DevOps-Python-tools/archive/master.zip
https://github.com/HariSekhon/pylib/archive/master.zip
Unzip both and move Pylib to the pylib
folder under DevOps Python Tools.
unzip devops-python-tools-master.zip
unzip pylib-master.zip
mv -v devops-python-tools-master pytools
mv -v pylib-master pylib
mv -vf pylib pytools/
Proceed to install PyPI modules for whichever programs you want to use using your usual procedure - usually an internal
mirror or proxy server to PyPI, or rpms / debs (some libraries are packaged by Linux distributions).
All PyPI modules are listed in the requirements.txt
and pylib/requirements.txt
files.
Internal Mirror example (JFrog Artifactory or similar):
sudo pip install --index https://host.domain.com/api/pypi/repo/simple --trusted host.domain.com -r requirements.txt
Proxy example:
sudo pip install --proxy hari:mypassword@proxy-host:8080 -r requirements.txt
The automated build also works on Mac OS X but you’ll need to install Apple XCode (on recent Macs just typing
git
is enough to trigger Xcode install).
I also recommend you get HomeBrew to install other useful tools and libraries you may need like OpenSSL for
development headers and tools such as wget (these are installed automatically if Homebrew is detected on Mac OS X):
bash-tools/install/install_homebrew.sh
brew install openssl wget
If failing to build an OpenSSL lib dependency, just prefix the build command like so:
sudo OPENSSL_INCLUDE=/usr/local/opt/openssl/include OPENSSL_LIB=/usr/local/opt/openssl/lib ...
You may get errors trying to install to Python library paths even as root on newer versions of Mac, sometimes this is
caused by pip 10 vs pip 9 and downgrading will work around it:
sudo pip install --upgrade pip==9.0.1
make
sudo pip install --upgrade pip
make
The 3 Hadoop utility programs listed below require Jython (as well as Hadoop to be installed and correctly configured)
hdfs_time_block_reads.jy
hdfs_files_native_checksums.jy
hdfs_files_stats.jy
Run like so:
jython -J-cp $(hadoop classpath) hdfs_time_block_reads.jy --help
The -J-cp $(hadoop classpath)
part dynamically inserts the current Hadoop java classpath required to use the Hadoop
APIs.
See below for procedure to install Jython if you don’t already have it.
This will download and install jython to /opt/jython-2.7.0:
make jython
Jython is a simple download and unpack and can be fetched from http://www.jython.org/downloads.html
Then add the Jython install bin directory to the $PATH or specify the full path to the jython
binary, eg:
/opt/jython-2.7.0/bin/jython hdfs_time_block_reads.jy ...
Strict validations include host/domain/FQDNs using TLDs which are populated from the official IANA list is done via my
PyLib library submodule - see there for details on configuring this to permit custom TLDs like .local
,
.intranet
, .vm
, .cloud
etc. (all already included in there because they’re common across companies internal
environments).
If you end up with an error like:
./dockerhub_show_tags.py centos ubuntu
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:765)
It can be caused by an issue with the underlying Python + libraries due to changes in OpenSSL and certificates. One
quick fix is to do the following:
sudo pip uninstall -y certifi &&
sudo pip install certifi==2015.04.28
Run:
make update
This will git pull and then git submodule update which is necessary to pick up corresponding library updates.
If you update often and want to just quickly git pull + submodule update but skip rebuilding all those dependencies each
time then run make update-no-recompile
(will miss new library dependencies - do full make update
if you encounter
issues).
Continuous Integration is run on this repo with tests for success and failure scenarios:
To trigger all tests run:
make test
which will start with the underlying libraries, then move on to top level integration tests and functional tests using
docker containers if docker is available.
Patches, improvements and even general feedback are welcome in the form of GitHub pull requests and issue tickets.
You might also be interested in the following really nice Jupyter notebook for HDFS space analysis created by another
Hortonworks guy Jonas Straub:
https://github.com/mr-jstraub/HDFSQuota/blob/master/HDFSQuota.ipynb
The rest of my original source repos are
here.
Pre-built Docker images are available on my DockerHub.