声明:这是我在大学毕业后进入第二家互联网公司学习的内容


灾难恢复-Gitlab数据恢复

2022.05.07事故总结

事故过程

  • 发生时间 2022-05-07 20.30左右

  • 发现现象 gitlab重启后不能正常开机

  • 报错

    1
    reboot and select proper boot device or insert boot media in selected boot device and press a key 
  • 硬盘启动区被破坏了

  • 解决办法 恢复数据或者重装系统

恢复过程

恢复数据

  • 将这块系统盘放到另外一台机器上运行,通过lvm命令发现该盘没有分区,已经处于一个被格式化的状态,查询资料后使用testdisk插件去尝试恢复

  • Testdisk

  • gitlab-disk-1.png

  • 恢复出来仍然不能挂载这个文件系统

  • 继续尝试几种方式,仍然不能把文件挂载出来,时间大概到2022.05.08 3:00

  • 保存这个镜像,后期再研究,然后准备重装系统

  • gitlab-disk-2.png

重装系统

重新安装操作系统

  • 安装centos8
  • 安装基础软件,下载nexus,gitlab-ee 镜像,挂载备份的硬盘
  • 将备份数据拷贝到主机

    恢复数据

  • nexus仓库为一周一备份,只能恢复到5.1日凌晨的备份
  • gitlab仓库为一周一备份,只能恢复到5.7日凌晨的备份
  • jenkins未备份,数据全丢,重要的脚本已经提前转移到gitlab上,目前已经找到

恢复结果

  • nexus已经正常恢复且可以正常运行
  • gitlab未完全恢复(目前正常运行,但是有部分功能因为脏数据的原因有BUG),代码仓库第一时间恢复到和生产环境一致的版本,然后新增runner、环境变量、跑通各个环境的pipline,上传release包到nexus
  • jenkins,由于大部分ci都迁移为gitlab,目前暂时不用安装

事故总结

事故原因

  • gitlab新增一块磁盘加入的时候未先测试一遍安装流程导致硬盘加载失败,具体原因仍然未知,后续继续跟进。
  • gitlab恢复失败,脚本未更改为备份ee版本文档指定备份文件,想当然用曾经的备份脚本。

gitlab升级成ee版本后备份脚本未更改,之前的脚本备份会额外备份.ssh目录和gitlab.rb,但是升级后备份需要将.ssh换成gitlab-secrets.json,脚本未更改导致密钥、环境变量、runner、仓库配置文件等全丢

现在每次备份都会有warning,但是没有关注

1
2
3
Warning: Your gitlab.rb and gitlab-secrets.json files contain sensitive data
and are not included in this backup. You will need these files to restore a backup.
Please back them up manually.

影响范围

约持续1天的gitlab、nexus不可用,代码部分丢失,gitlab功能部分不正常,环境变量和密钥等永久丢失,需要靠人工重新添加,无法发布新版本到测试生产环境

责任人

RuGod

  • 新硬盘加入系统盘未经过测试
  • 未能及时更新备份的脚本(当版本升级之后)和模拟新版本恢复流程

建议处理:警告处分

后续改进

  • 硬件添加需要经过测试机测试才能上我们正式的服务器
  • 软件版本升级需要经过测试,备份,恢复流程,备份脚本也要及时查看新版本的文档查看是否有更改
  • 清点现在的未备份的服务,增加可靠的备份脚本
  • 目前备份的策略也是服务器之间相互备份,只有本地和另一台机器的备份数据,建议购买一台专用的备份服务器,本地备份,以及至少备份重要数据到一个云平台(例如AWS S3或者阿里云OSS)

后续修复

1.不挂载任何数据启动gitlab,授权gitlab备份文件

1
2
3
4
5
6
7
8
9
root@gitlab:/# docker-compose up -d
root@gitlab:/# docker exec -it gitlab bash
root@gitlab:/var/opt/gitlab/backups# ll
total 9616216
drwx------. 2 git root 71 May 9 15:26 ./
drwxr-xr-x. 19 root root 4096 May 9 15:28 ../
-rw-r--r--. 1 root root 9846999040 May 9 15:27 1652099118_2022_05_09_14.9.2-ee_gitlab_backup.tar
root@gitlab:/var/opt/gitlab/backups# chown git:git /var/opt/gitlab/backups/1652099118_2022_05_09_14.9.2-ee_gitlab_backup.tar

2.停止连接数据库的服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
gitlab-ctl stop puma
gitlab-ctl stop sidekiq
gitlab-backup restore BACKUP=1652099118_2022_05_09_14.9.2-ee
Transfering ownership of /var/opt/gitlab/gitlab-rails/shared/registry to git
Unpacking backup ... done
2022-05-09 15:33:55 +0000 -- Restoring database ...
2022-05-09 15:33:55 +0000 -- Be sure to stop Puma, Sidekiq, and any other process that
connects to the database before proceeding. For Omnibus
installs, see the following link for more information:
https://docs.gitlab.com/ee/raketasks/backup_restore.html#restore-for-omnibus-gitlab-installations

Before restoring the database, we will remove all existing
tables to avoid future upgrade problems. Be aware that if you have
custom tables in the GitLab database these tables and all data will be
removed.

Do you want to continue (yes/no)?yes
Removing all tables. Press `Ctrl-C` within 5 seconds to abort
2022-05-09 15:34:13 UTC -- Cleaning the database ...
2022-05-09 15:34:15 UTC -- done
Restoring PostgreSQL database gitlabhq_production ... ERROR: must be owner of extension pg_trgm
ERROR: must be owner of extension btree_gist
ERROR: must be owner of extension btree_gist
ERROR: must be owner of extension pg_trgm

#这个报错不影响

2022-05-09 15:36:12 +0000 -- done
2022-05-09 15:36:12 +0000 -- Restoring uploads ...
2022-05-09 15:36:15 +0000 -- done
2022-05-09 15:36:15 +0000 -- Restoring builds ...
2022-05-09 15:36:15 +0000 -- done
2022-05-09 15:36:15 +0000 -- Restoring artifacts ...
2022-05-09 15:36:58 +0000 -- done
2022-05-09 15:36:58 +0000 -- Restoring pages ...
2022-05-09 15:36:58 +0000 -- done
2022-05-09 15:36:58 +0000 -- Restoring lfs objects ...
2022-05-09 15:36:58 +0000 -- done
2022-05-09 15:36:58 +0000 -- Restoring terraform states ...
2022-05-09 15:36:58 +0000 -- done
2022-05-09 15:36:58 +0000 -- Restoring container registry images ...
2022-05-09 15:36:58 +0000 -- done
2022-05-09 15:36:58 +0000 -- Restoring packages ...
2022-05-09 15:36:58 +0000 -- done
This task will now rebuild the authorized_keys file.
You will lose any data stored in the authorized_keys file.
Do you want to continue (yes/no)? yes

Deleting backups/tmp ... done
Warning: Your gitlab.rb and gitlab-secrets.json files contain sensitive data
and are not included in this backup. You will need to restore these files manually.
Restore task is done.
Transfering ownership of /var/opt/gitlab/gitlab-rails/shared/registry to registry


3.恢复成功后验证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
root@gitlab:/# gitlab-rake gitlab:doctor:secrets

I, [2022-05-09T15:38:53.075563 #4748] INFO -- : Checking encrypted values in the database
I, [2022-05-09T15:38:58.716706 #4748] INFO -- : - Ci::InstanceVariable failures: 0
I, [2022-05-09T15:38:58.743346 #4748] INFO -- : - Ci::PipelineScheduleVariable failures: 9
I, [2022-05-09T15:38:58.775164 #4748] INFO -- : - Ci::Variable failures: 0
I, [2022-05-09T15:38:58.798421 #4748] INFO -- : - Ci::GroupVariable failures: 0
I, [2022-05-09T15:38:58.808623 #4748] INFO -- : - Ci::JobVariable failures: 0
I, [2022-05-09T15:38:58.857484 #4748] INFO -- : - Ci::PipelineVariable failures: 35
I, [2022-05-09T15:38:58.867923 #4748] INFO -- : - ApplicationSetting failures: 0
I, [2022-05-09T15:38:59.385816 #4748] INFO -- : - User failures: 1
I, [2022-05-09T15:38:59.397883 #4748] INFO -- : - GeoNode failures: 0
I, [2022-05-09T15:38:59.413367 #4748] INFO -- : - Clusters::Platforms::Kubernetes failures: 1
I, [2022-05-09T15:38:59.443323 #4748] INFO -- : - Snippet failures: 0
I, [2022-05-09T15:38:59.444312 #4748] INFO -- : - PersonalSnippet failures: 0
I, [2022-05-09T15:38:59.445102 #4748] INFO -- : - ProjectSnippet failures: 0
I, [2022-05-09T15:38:59.455251 #4748] INFO -- : - Clusters::Applications::Helm failures: 0
I, [2022-05-09T15:38:59.465617 #4748] INFO -- : - Clusters::Applications::Prometheus failures: 0
rake aborted!
OpenSSL::Cipher::CipherError:
/opt/gitlab/embedded/service/gitlab-rails/app/models/integration.rb:405:in `copy_properties_to_encrypted_properties'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:126:in `block in write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:112:in `block in read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:172:in `retry_with_backoff'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:110:in `read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:125:in `write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `method_missing'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/doctor/secrets.rb:38:in `block in check_model_attributes'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/doctor/secrets.rb:36:in `each'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/doctor/secrets.rb:36:in `check_model_attributes'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/doctor/secrets.rb:26:in `run!'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/doctor/secrets.rake:11:in `block (3 levels) in <top (required)>'
/opt/gitlab/embedded/bin/bundle:23:in `load'
/opt/gitlab/embedded/bin/bundle:23:in `<main>'
Tasks: TOP => gitlab:doctor:secrets
(See full trace by running task with --trace)

报错原因是没有integrations模块解密失败

解决integrations问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117

root@gitlab:/# gitlab-rails c
--------------------------------------------------------------------------------
Ruby: ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux]
GitLab: 14.9.2-ee (3034418fb31) EE
GitLab Shell: 13.24.0
PostgreSQL: 12.7
------------------------------------------------------------[ booted in 19.15s ]
Loading production environment (Rails 6.1.4.6)
irb(main):001:0>Integration.update_all(encrypted_properties: nil)
=> 17
irb(main):002:0> exit

root@gitlab:/# gitlab-rake gitlab:doctor:secrets
I, [2022-05-09T15:41:48.848028 #5101] INFO -- : Checking encrypted values in the database
I, [2022-05-09T15:41:54.626427 #5101] INFO -- : - Ci::InstanceVariable failures: 0
I, [2022-05-09T15:41:54.653169 #5101] INFO -- : - Ci::PipelineScheduleVariable failures: 9
I, [2022-05-09T15:41:54.685570 #5101] INFO -- : - Ci::Variable failures: 0
I, [2022-05-09T15:41:54.708865 #5101] INFO -- : - Ci::GroupVariable failures: 0
I, [2022-05-09T15:41:54.719064 #5101] INFO -- : - Ci::JobVariable failures: 0
I, [2022-05-09T15:41:54.767616 #5101] INFO -- : - Ci::PipelineVariable failures: 35
I, [2022-05-09T15:41:54.776018 #5101] INFO -- : - ApplicationSetting failures: 0
I, [2022-05-09T15:41:54.826346 #5101] INFO -- : - User failures: 1
I, [2022-05-09T15:41:54.838384 #5101] INFO -- : - GeoNode failures: 0
I, [2022-05-09T15:41:54.853359 #5101] INFO -- : - Clusters::Platforms::Kubernetes failures: 1
I, [2022-05-09T15:41:54.884088 #5101] INFO -- : - Snippet failures: 0
I, [2022-05-09T15:41:54.885233 #5101] INFO -- : - PersonalSnippet failures: 0
I, [2022-05-09T15:41:54.886105 #5101] INFO -- : - ProjectSnippet failures: 0
I, [2022-05-09T15:41:54.896200 #5101] INFO -- : - Clusters::Applications::Helm failures: 0
I, [2022-05-09T15:41:54.906413 #5101] INFO -- : - Clusters::Applications::Prometheus failures: 0
I, [2022-05-09T15:41:54.953231 #5101] INFO -- : - Integration failures: 0
I, [2022-05-09T15:41:54.955038 #5101] INFO -- : - Integrations::Github failures: 0
I, [2022-05-09T15:41:54.956736 #5101] INFO -- : - Integrations::Asana failures: 0
I, [2022-05-09T15:41:54.958873 #5101] INFO -- : - Integrations::Assembla failures: 0
I, [2022-05-09T15:41:54.960510 #5101] INFO -- : - Integrations::BaseCi failures: 0
I, [2022-05-09T15:41:54.961908 #5101] INFO -- : - Integrations::Bamboo failures: 0
I, [2022-05-09T15:41:54.963366 #5101] INFO -- : - Integrations::Buildkite failures: 0
I, [2022-05-09T15:41:54.964794 #5101] INFO -- : - Integrations::DroneCi failures: 0
I, [2022-05-09T15:41:54.966252 #5101] INFO -- : - Integrations::Jenkins failures: 0
I, [2022-05-09T15:41:54.967642 #5101] INFO -- : - Integrations::Teamcity failures: 0
I, [2022-05-09T15:41:54.969132 #5101] INFO -- : - Integrations::MockCi failures: 0
I, [2022-05-09T15:41:54.970669 #5101] INFO -- : - Integrations::BaseIssueTracker failures: 0
I, [2022-05-09T15:41:54.972130 #5101] INFO -- : - Integrations::Bugzilla failures: 0
I, [2022-05-09T15:41:54.973615 #5101] INFO -- : - Integrations::CustomIssueTracker failures: 0
I, [2022-05-09T15:41:54.975018 #5101] INFO -- : - Integrations::Ewm failures: 0
I, [2022-05-09T15:41:54.976444 #5101] INFO -- : - Integrations::Jira failures: 0
I, [2022-05-09T15:41:54.977818 #5101] INFO -- : - Integrations::Redmine failures: 0
I, [2022-05-09T15:41:54.979229 #5101] INFO -- : - Integrations::Youtrack failures: 0
I, [2022-05-09T15:41:54.980613 #5101] INFO -- : - Integrations::Campfire failures: 0
I, [2022-05-09T15:41:54.981974 #5101] INFO -- : - Integrations::Confluence failures: 0
I, [2022-05-09T15:41:54.983358 #5101] INFO -- : - Integrations::Datadog failures: 0
I, [2022-05-09T15:41:54.984828 #5101] INFO -- : - Integrations::BaseChatNotification failures: 0
I, [2022-05-09T15:41:54.986256 #5101] INFO -- : - Integrations::Discord failures: 0
I, [2022-05-09T15:41:54.987580 #5101] INFO -- : - Integrations::HangoutsChat failures: 0
I, [2022-05-09T15:41:54.988881 #5101] INFO -- : - Integrations::Mattermost failures: 0
I, [2022-05-09T15:41:54.990183 #5101] INFO -- : - Integrations::MicrosoftTeams failures: 0
I, [2022-05-09T15:41:54.991490 #5101] INFO -- : - Integrations::Slack failures: 0
I, [2022-05-09T15:41:54.992790 #5101] INFO -- : - Integrations::UnifyCircuit failures: 0
I, [2022-05-09T15:41:54.994111 #5101] INFO -- : - Integrations::WebexTeams failures: 0
I, [2022-05-09T15:41:54.995420 #5101] INFO -- : - Integrations::EmailsOnPush failures: 0
I, [2022-05-09T15:41:54.996709 #5101] INFO -- : - Integrations::ExternalWiki failures: 0
I, [2022-05-09T15:41:54.998020 #5101] INFO -- : - Integrations::Flowdock failures: 0
I, [2022-05-09T15:41:54.999316 #5101] INFO -- : - Integrations::Harbor failures: 0
I, [2022-05-09T15:41:55.000656 #5101] INFO -- : - Integrations::Irker failures: 0
I, [2022-05-09T15:41:55.002045 #5101] INFO -- : - Integrations::BaseSlashCommands failures: 0
I, [2022-05-09T15:41:55.003407 #5101] INFO -- : - Integrations::MattermostSlashCommands failures: 0
I, [2022-05-09T15:41:55.004785 #5101] INFO -- : - Integrations::SlackSlashCommands failures: 0
I, [2022-05-09T15:41:55.006087 #5101] INFO -- : - Integrations::Packagist failures: 0
I, [2022-05-09T15:41:55.007377 #5101] INFO -- : - Integrations::PipelinesEmail failures: 0
I, [2022-05-09T15:41:55.008745 #5101] INFO -- : - Integrations::Pivotaltracker failures: 0
I, [2022-05-09T15:41:55.012251 #5101] INFO -- : - Integrations::BaseMonitoring failures: 0
I, [2022-05-09T15:41:55.015523 #5101] INFO -- : - Integrations::Prometheus failures: 0
I, [2022-05-09T15:41:55.016855 #5101] INFO -- : - Integrations::MockMonitoring failures: 0
I, [2022-05-09T15:41:55.018164 #5101] INFO -- : - Integrations::Pushover failures: 0
I, [2022-05-09T15:41:55.019516 #5101] INFO -- : - Integrations::Zentao failures: 0
I, [2022-05-09T15:41:55.020881 #5101] INFO -- : - Integrations::Shimo failures: 0
I, [2022-05-09T15:41:55.022222 #5101] INFO -- : - Integrations::GitlabSlackApplication failures: 0
I, [2022-05-09T15:41:55.032708 #5101] INFO -- : - AlertManagement::HttpIntegration failures: 0
I, [2022-05-09T15:41:55.042969 #5101] INFO -- : - GrafanaIntegration failures: 0
I, [2022-05-09T15:41:55.053062 #5101] INFO -- : - JiraConnectInstallation failures: 0
I, [2022-05-09T15:41:55.063690 #5101] INFO -- : - PagesDomain failures: 0
I, [2022-05-09T15:41:55.073923 #5101] INFO -- : - PagesDomainAcmeOrder failures: 0
I, [2022-05-09T15:41:55.083946 #5101] INFO -- : - ProjectImportData failures: 0
I, [2022-05-09T15:41:55.093869 #5101] INFO -- : - RemoteMirror failures: 0
I, [2022-05-09T15:41:55.127068 #5101] INFO -- : - WebHook failures: 0
I, [2022-05-09T15:41:55.128418 #5101] INFO -- : - ProjectHook failures: 0
I, [2022-05-09T15:41:55.129777 #5101] INFO -- : - ServiceHook failures: 0
I, [2022-05-09T15:41:55.131082 #5101] INFO -- : - SystemHook failures: 0
I, [2022-05-09T15:41:55.132371 #5101] INFO -- : - GroupHook failures: 0
I, [2022-05-09T15:41:55.142613 #5101] INFO -- : - Alerting::ProjectAlertingSetting failures: 0
I, [2022-05-09T15:41:55.152839 #5101] INFO -- : - Atlassian::Identity failures: 0
I, [2022-05-09T15:41:55.163829 #5101] INFO -- : - BulkImports::Configuration failures: 0
I, [2022-05-09T15:41:55.174150 #5101] INFO -- : - Clusters::KubernetesNamespace failures: 0
I, [2022-05-09T15:41:55.184138 #5101] INFO -- : - ErrorTracking::ProjectErrorTrackingSetting failures: 0
I, [2022-05-09T15:41:55.194162 #5101] INFO -- : - IncidentManagement::ProjectIncidentManagementSetting failures: 0
I, [2022-05-09T15:41:55.204361 #5101] INFO -- : - Integrations::IssueTrackerData failures: 0
I, [2022-05-09T15:41:55.214521 #5101] INFO -- : - Integrations::JiraTrackerData failures: 0
I, [2022-05-09T15:41:55.224600 #5101] INFO -- : - Integrations::ZentaoTrackerData failures: 0
I, [2022-05-09T15:41:55.234745 #5101] INFO -- : - Serverless::DomainCluster failures: 0
I, [2022-05-09T15:41:55.235690 #5101] INFO -- : - Clusters::Integrations::Prometheus failures: 0
I, [2022-05-09T15:41:55.245732 #5101] INFO -- : - Dast::SiteProfileSecretVariable failures: 0
I, [2022-05-09T15:41:55.255860 #5101] INFO -- : - StatusPage::ProjectSetting failures: 0
I, [2022-05-09T15:41:55.267199 #5101] INFO -- : - Clusters::Providers::Aws failures: 0
I, [2022-05-09T15:41:55.277610 #5101] INFO -- : - Clusters::Providers::Gcp failures: 0
I, [2022-05-09T15:41:55.287740 #5101] INFO -- : - Packages::Debian::GroupDistributionKey failures: 0
I, [2022-05-09T15:41:55.297773 #5101] INFO -- : - Packages::Debian::ProjectDistributionKey failures: 0
I, [2022-05-09T15:41:55.299025 #5101] INFO -- : - Gitlab::BackgroundMigration::BackfillJiraTrackerDeploymentType2::JiraTrackerDataTemp failures: 0
I, [2022-05-09T15:41:55.320413 #5101] INFO -- : - Ci::Runner failures: 0
I, [2022-05-09T15:41:58.204407 #5101] INFO -- : - Ci::Build failures: 0
I, [2022-05-09T15:41:58.586279 #5101] INFO -- : - Group failures: 0
I, [2022-05-09T15:42:01.604851 #5101] INFO -- : - Project failures: 0
I, [2022-05-09T15:42:01.621301 #5101] INFO -- : - DeployToken failures: 10
I, [2022-05-09T15:42:01.632694 #5101] INFO -- : - Clusters::AgentToken failures: 0
I, [2022-05-09T15:42:01.643991 #5101] INFO -- : - ScimOauthAccessToken failures: 0
I, [2022-05-09T15:42:01.655934 #5101] INFO -- : - Operations::FeatureFlagsClient failures: 9
I, [2022-05-09T15:42:01.656019 #5101] INFO -- : Total: 65 row(s) affected
I, [2022-05-09T15:42:01.656030 #5101] INFO -- : Done!

修复secret,将出现错误的表都数据清空

详情可以参考

https://docs.gitlab.com/ee/raketasks/backup_restore.html#reset-cicd-variables

这里我按照官网的恢复操作都执行完了,但是还是报错,找了很久发现,这些报错的表并没有建议的操作。

看来文档并没有记录,我网上查了很多,发现是14.9之后的新BUG,并且文档没有记录这种恢复操作,看来是我个人处理了。

解决方式,查看gitlab-postgres表结构(自己看官网文档每一个表结构是干嘛的),有一些并不能删除,只能通过update语句去操作,这里不提供出来(毕竟花了我大量时间去写),只提供一些删表的操作SQL,这些表的数据删了是不影响的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
root@gitlab:/# gitlab-rails dbconsole --database main
psql (12.7)
Type "help" for help.

gitlabhq_production=> delete from deploy_tokens;
DELETE 10
gitlabhq_production=> delete from operations_feature_flags_clients;
DELETE 10
gitlabhq_production=> delete from cluster_platforms_kubernetes;
DELETE 1
gitlabhq_production=> delete from ci_pipeline_schedule_variables;
DELETE 13
gitlabhq_production=> delete from ci_pipeline_variables;
DELETE 35
gitlabhq_production=> exit

root@gitlab:/# gitlab-rake gitlab:doctor:secrets
I, [2022-05-09T15:45:04.769158 #5466] INFO -- : Total: 0 row(s) affected
I, [2022-05-09T15:45:04.769168 #5466] INFO -- : Done!

# 继续检验
root@gitlab:/# gitlab-rake gitlab:artifacts:check

gitlab-rake gitlab:lfs:check
Checking integrity of LFS objects
- 1..3: Failures: 0
Done!

root@gitlab:/# gitlab-rake gitlab:uploads:check
Checking integrity of Uploads
- 11..218: Failures: 0
- 219..419: Failures: 0
- 420..620: Failures: 0
- 621..823: Failures: 0
- 824..1026: Failures: 0
- 1027..1221: Failures: 0
Done!

root@gitlab:/# gitlab-rake gitlab:check SANITIZE=true

校验完成

gitlab 500

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$ docker-compose restart

运行一段时间后发现gitlab经常500

查看日志

writing value to /dev/shm/gitlab/sidekiq/gauge_all_sidekiq_0-1.db failed with unmapped file

网上查了下发现是/dev/shm给的空间不足

$ docker exec -it gitlab df -h
Filesystem Size Used Avail Use% Mounted on
overlay 2.8T 1.1T 1.7T 39% /
tmpfs 64M 0 64M 0% /dev
tmpfs 32G 0 32G 0% /sys/fs/cgroup
shm 64M 64M 64M 100% /dev/shm
/dev/mapper/gitlab-root 2.8T 1.1T 1.7T 39% /etc/gitlab

解决办法,启动的时候在配置文件加上 shm_size: '256m'

$ cat docker-compose.yaml
version: '3.6'
services:
gitlab:
image: 'gitlab/gitlab-ee:14.9.2-ee.0'
restart: always
hostname: 'gitlab'
container_name: gitlab
network_mode: bridge
privileged: true
user: root
ports:
- '80:80'
- '443:443'
- '2222:22'
- '4567:4567'
volumes:
- './gitlab/config:/etc/gitlab'
- './gitlab/logs:/var/log/gitlab'
- './gitlab/data:/var/opt/gitlab'
shm_size: '256m'

再次重启不会出现这个问题

总结

至此Gitlab恢复完成,总结来说其实如果当初备份脚本及时更新的话,就算数据丢失一天也是可以接受的,毕竟每个人本地都保留了最新的代码,可以第二天让大家对一下各个项目最后一个提交的人就可以恢复了,不会造成这么严重的后果,排查gitlab的问题花了我快1天的时间才能解决,这次事故肯定使我铭记于心。

我及时和领导承认错误,没想到领导让我不要放在心上,小问题~,虽然我心里还是比较不好受,但是领导说是人就会犯错,他会犯错,我也会犯错,但是我们可以尽量做一些事情减少我们的失误,比如硬件模拟,备份恢复演练等等,我大受鼓舞。

顺便说下,之前还好把jenkins转移到gitlab了,也把harbor从gitlab上拆出来了,不然这次死都不知道怎么死的,还好提前做了线上/线下的服务器分离部署,不然gitlab挂了真影响太多问题了。

参考资料

Integrations secret doctor fails

Repository integrity

restore-gitlab

admin/application_settings 500 error

linux文件恢复工具下载_十大最佳Linux数据恢复工具,用于恢复已删除/损坏的文件

Testdisk数据恢复

v11.9.0 Omnibus: problems with OpenSSL::Cipher::CipherError ():

No pipeline triggerred on commit since moving Gitlab to another server

/dev/shm mount not having enough space in Docker container


版权声明:

原创不易,洗文可耻。除非注明,本博文章均为原创,转载请以链接形式标明本文地址。